Re: 2.6.24.3: regular sata drive resets - worrisome?

Hein-Pieter van Braam <hp@xxxxxx> · Sun, 30 Mar 2008 23:02:15 +0200

On Sun, 2008-03-30 at 07:41 -0500, Roger Heflin wrote:
> Hans-Peter Jansen wrote:
> > Am Sonntag, 30. März 2008 schrieb Tejun Heo:
> >> Hello,
> >>
> >> Hans-Peter Jansen wrote:
> >>>>>> Should I be worried? smartd doesn't show anything suspicious on
> >>>>>> those.
> >>>> Can you please post the result of "smartctl -a /dev/sdX"?
> >>> Here's the last smart report from two of the offending drives. As noted
> >>> before, I did the hardware reorganization, replaced the dog slow 3ware
> >>> 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the
> >>> drives for now, but a nephew already showed interest. What do you
> >>> think, can I cede those drives with a clear conscience? The
> >>> Hardware_ECC_Recovered values are really worrisome, aren't they?
> >> Different vendors use different scales for the raw values.  The value is
> >> still pegged at the highest so it could be those raw values are okay or
> >> that the vendor just doesn't update value field accordingly.  My P120
> >> says 0 for the raw value and 904635 for hardware ECC recovered so there
> >> is some difference.  What do other non-failing drives say about those
> >> values?
> > 
> > The only non-failing drive was sdf as it was running in standby mode in this 
> > md raid 5 ensemble:
> > 
> > 20080323-011337-sdc.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162956700
> > 20080323-011337-sdc.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> > 20080323-011337-sdc.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> > 20080323-011337-sdc.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> > 20080323-011337-sdc.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> > 20080323-011338-sdd.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162520674
> > 20080323-011338-sdd.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> > 20080323-011338-sdd.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> > 20080323-011338-sdd.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> > 20080323-011338-sdd.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> > 20080323-011338-sde.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       148429049
> > 20080323-011338-sde.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> > 20080323-011338-sde.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> > 20080323-011338-sde.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> > 20080323-011338-sde.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> > 20080323-011339-sdf.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1559
> > 20080323-011339-sdf.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> > 20080323-011339-sdf.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> > 20080323-011339-sdf.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> > 20080323-011339-sdf.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> > 
> >> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
> >> reallocation counters and maybe some pending counts.  Aieee.. weird.
> > 
> > But there are no reallocations nor any pending sectors on any of them.
> > 
> >>>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
> >>>> FLUSH_EXT timing out usually indicates that the drive is having
> >>>> problem writing out what it has in its cache to the media.  There was
> >>>> one case where FLUSH_EXT timeout was caused by the driver failing to
> >>>> switch controller back from NCQ mode before issuing FLUSH_EXT but that
> >>>> was on sata_nv.  There hasn't been any similar problem on sata_sil24.
> >>> Hmm, I didn't noticed any data distortions, and if there where, they
> >>> live on as copies in their new home..
> >> It should have appeared as read errors.  Maybe the drive successfully
> >                              ^^^^
> >                              write (I guess)
> >> wrote those sectors after 30+ secs timeout.
> > 
> > That would point to some driver issue, wouldn't it? Roger Heflin also
> > experienced similar behavior with that controller, which wasn't 
> > reproducible with another. 
> > 
> > I can offer to you rebuilding that md in a test environment, and giving 
> > you access to it, if you're interested.
> > 
> > Anyway, thanks for caring Tejun,
> > Pete
> > 
> 
> Here are the errors I get, though look at it closer, I am don't appear to be 
> getting the reset, just this error from time to time:
> 
> sd 9:0:0:0: [sde] 976773168 512-byte hardware sectors (500108 MB)
> sd 9:0:0:0: [sde] Write Protect is off
> sd 9:0:0:0: [sde] Mode Sense: 00 3a 00 00
> sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO 
> or FUA
> ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x0
> ata8.00: BMDMA2 stat 0x687d8009
> ata8.00: cmd 25/00:80:a7:00:1d/00:01:1d:00:00/e0 tag 0 cdb 0x0 data 196608 in
>           res 51/04:8f:98:01:1d/00:00:1d:00:00/f0 Emask 0x1 (device error)
> ata8.00: configured for UDMA/100
> ata8: EH complete
> sd 7:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
> sd 7:0:0:0: [sdd] Write Protect is off
> sd 7:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO 
> or FUA
> 
> I have 4 identical disks, with all 4 connected to the SIL controller all give 
> some errors, moving 2 of the disks to a promise controller makes the errors go 
> away on the 2 connected to the promise controller.   All drives are part of a 
> software raid5 array.
> 
> Startup looks like this:
> sata_sil 0000:00:09.0: version 2.3
> ACPI: PCI Interrupt 0000:00:09.0[A] -> GSI 16 (level, low) -> IRQ 20
> sata_sil 0000:00:09.0: Applying R_ERR on DMA activate FIS errata fix
> scsi7 : sata_sil
> scsi8 : sata_sil
> scsi9 : sata_sil
> scsi10 : sata_sil
> ata8: SATA max UDMA/100 cmd 0xf8942080 ctl 0xf894208a bmdma 0xf8942000 irq 20
> ata9: SATA max UDMA/100 cmd 0xf89420c0 ctl 0xf89420ca bmdma 0xf8942008 irq 20
> ata10: SATA max UDMA/100 cmd 0xf8942280 ctl 0xf894228a bmdma 0xf8942200 irq 20
> ata11: SATA max UDMA/100 cmd 0xf89422c0 ctl 0xf89422ca bmdma 0xf8942208 irq 20
> 
> Right now I am running 2.6.23.15-80.fc7, but have also got the errors under 2.6.23.1

I know this is probably not too helpful, but I had the same or similar
problems on a sata_nv based controller back in 2.6.20 ish times. I never
reported it, sadly... but I managed to get them to go away by disabling
adma on the controller.

Probably not very helpful, 2 cents, and all :)

> 
>                                      Roger
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html