Re: data corruption - the nightmare continues

"Mike Black" <mblack@csihq.com> · Wed, 20 Mar 2002 06:56:39 -0500

I see the same behavior (you didn't say what error messages were in your
log) -- I get a random disk that pops out every once in a while.  Mine's
fibre channel though with a qlogic controller.  I don't even have to reboot
anymore -- just remove it and re-add it to the raid set.
Here's the last:
Mar 15 10:20:49 yeti kernel: SCSI disk error : host 5 channel 0 id 0 lun 0
return code = 28000002
Mar 15 10:20:49 yeti kernel: Current sd41:01: sense key Hardware Error
Mar 15 10:20:49 yeti kernel: Additional sense indicates Internal target
failure
Mar 15 10:20:49 yeti kernel:  I/O error: dev 41:01, sector 8696056
Mar 15 10:20:49 yeti kernel: raid5: Disk failure on sdq1, disabling device.
Operation continuing on 6 devices

Here' s my history (I just added two more SCSI card so the channel on this
set has moved from 3 to 5 now).  Also, I should mention that I don't see
these problems on my fibre-channel set.
Mar 15 10:20:49 yeti kernel: SCSI disk error : host 5 channel 0 id 0 lun 0
return code = 28000002
Nov 24 20:04:28 medusa kernel: SCSI disk error : host 2 channel 0 id 10 lun
0 return code = 10000
Nov  5 09:01:15 yeti kernel: SCSI disk error : host 3 channel 0 id 6 lun 0
return code = 28000002
Aug 14 13:47:57 yeti kernel: SCSI disk error : host 3 channel 0 id 1 lun 0
return code = 28000002
Aug  4 17:17:00 yeti kernel: SCSI disk error : host 3 channel 0 id 0 lun 0
return code = 28000002
Jul 29 08:09:29 yeti kernel: SCSI disk error : host 3 channel 0 id 4 lun 0
return code = 28000002

________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Justin" <jb@dslreports.com>
To: <linux-raid@vger.kernel.org>
Sent: Tuesday, March 19, 2002 6:18 PM
Subject: Re: data corruption - the nightmare continues

FWIW i get the same thing ..

some of my raid1 arrays tend to become U_ after a few months
of light use. Rebooting the box allows the device to be
addressable again, and the disk is not, in fact, bad ..

I can do a complete dd to the "bad" disk without error, then
raidhotadd it back in again as well. A few months later of
uptime, it is U_ again..

On an example box where this happens, the kernel is SMP 2.4.2
the controller is motherboard Adaptec 7896, the driver is aic7xxx
the disks are ultra lvds, the cables and disk mounts are
all by intel so I do not suspect a termination or cabling issue.
The motherboard is 440GX.

I am curious to see whether my other boxes which are 2.4.18
SMP will be more stable.
-Justin

On Tue, Mar 19, 2002 at 11:58:46PM +0100, Marcel wrote:
> Rainer Fuegenstein wrote:
> >
> > Additional sense indicates Unrecovered read error
> >  I/O error: dev 08:19, sector 12850360
>
> <big snip>
>
> You should enable verbose SCSI error reporting in the kernel. It's a
> compile time kernel option. This will tell you more about what's going
> on in the disk subsystem.
>
> The above error message is not enough and if it's all you get, even with
> verbose error reporting enabled, you should talk to people more familiar
> with the SCSI drivers. Meanwhile double-check whether SCSI bus
> termination is done "by the book". Failure to do so can also cause some
> nasty intermittent problems.
>
> Marcel
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html