Re: Raid 6 Fail Event

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 17 Nov 2014 10:19:26 -0700

On Nov 16, 2014, at 6:34 PM, Justin Stephenson <justin@xxxxxxxxxxxxxxxxx> wrote:

> Thank-you, Chris. I appreciate your help with this.
> 
> Backup are good. I'm a regular disk to disk to LTO guy. Here is what I have turned up:
> 
> ================================
> # smartctl -x /dev/sdh
> 
> big long list of stuff.

Please post it.

> I found the serial.
> 
> I also tried smartctl -H /dev/sdh and received
> 
> Overall-health self-assesment test restul: PASSED
> 
> 184 End-to-End_Error {flag value worst thresh} Old_age FAILING_NOW_6

Cute, it’s failing but it’s overall health is passing. This is a great example of why the health self-assess is useless.

> 
> I did not find anything for the serial in results from dmesg
> 
> # smartctl -l scterc /dev/sdh
> 
> Warning: device does not support SCT Commands

Interesting it supports a SMART IV attribut but doesn’t support SCT commands.

> 
> # cat /sys/block/sdh/device/state
> 
> Running
> 
> # cat /sys/block/sdh/device/timeout
> 
> 30

Since the drives you have don’t support SCT commands, you need to set the command timer to something much more than the default of 30, otherwise your array will not function correctly when it encounters bad sectors. In many cases the linux scsi command timer will reach 30 seconds and reset the interface, before the typical consumer drive recovers (either returns data successfully or an error). This could be quite long, maybe 2 minutes. Future drives you buy should have configurable SCT ERC so the drive can be set to return a read error after something like 7 seconds, i.e. you want the drive to give up sooner, and by informing md of the problem sector range, the data is rebuilt from parity and written back to the bad sectors on the drive where the problem gets fixed.

> 
> ================================
> 
> Should I replace the drive or re add and resync?

Well I don’t know anything about attribute 184 End-to-End error, but based on the description in wikipedia it sounds disqualifying to me.

I personally would get the drive replaced no matter what: either under warranty, or if no warranty I’d get a new drive and test/play with this one offline and if it proves its worth then maybe it can be a spare down the road.

But you could also smartctl -x all the other drives and see what value they have for this attribute.

> 
> I also went through and reseated all the SATA and power connections as I understand these can cause issues as well.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html