Re: raid/device failure

Phil Turmel <philip@xxxxxxxxxx> · Sun, 10 Feb 2013 21:09:51 -0500

On 02/10/2013 08:27 PM, Thomas Fjellstrom wrote:
> I've re-configured my NAS box (still haven't put it into "production") to be a 
> raid5 over 7 2TB consumer seagate barracuda drives, and with some tweaking, 
> performance was looking stellar.
> 
> Unfortunately I started seeing some messages in dmesg that worried me:

[trim /]

The MD subsystem keeps a count of read errors on each device, corrected
or not, and kicks the drive out when the count reaches twenty (20).
Every hour, the accumulated count is cut in half to allow for general
URE "maintenenance" in regular scrubs.  This behavior and the count are
hardcoded in the kernel source.

> I've run full S.M.A.R.T. tests (except the conveyance test, probably run that 
> tonight and see what happens) on all drives in the array, and there are no 
> obvious warnings or errors in the S.M.A.R.T. restults at all. Including 
> reallocated (pending or not) sectors.

MD fixed most of these errors, so I wouldn't expect to see them in SMART
unless the fix triggered a relocation.  But some weren't corrected--so I
would be concerned that MD and SMART don't agree.

Have these drives ever been scrubbed?  (I vaguely recall you mentioning
new drives...)  If they are new and already had a URE, I'd be concerned
about mishandling during shipping.  If they aren't new, I'd
destructively exercise them and retest.

> I've seen references while searching for possible causes, where people had 
> this error occur with faulty cables, or SAS backplanes. Is this a likely 
> senario? The cables are brand new, but anything is possible.
> 
> The card is a IBM M1015 8 port HBA flashed with the LSI 9211-8i IT firmware, 
> and no BIOS.

It might not hurt to recheck your power supply rating vs. load.  If you
can't find anything else, a data-logging voltmeter with min/max capture
would be my tool of choice.

http://www.fluke.com/fluke/usen/digital-multimeters/fluke-287.htm?PID=56058

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html