Re: RAID6, failed device, unresponsive system?

Mathias Burén <mathias.buren@xxxxxxxxx> · Wed, 18 Jan 2012 09:22:31 +0000

On 18 January 2012 06:54, Stefan /*St0fF*/ Hübner
<stefan.huebner@xxxxxxxxxxxxxxxxxx> wrote:
> Hi
>
> Am 17.01.2012 13:33, schrieb Mathias Burén:
>> On 17 January 2012 12:02, Peter Grandi <pg@xxxxxxxxxxxxxxxxxxxx> wrote:
>>> [ ... ]
>>>
>>>>> Why is the system unresponsive, shouldn't it still be OK
>>>>> after a drive failure?
>>>
>>> There is a bit of a difference between a "drive failure" and
>>> some/several bad sectors on a drive.
>>>
>>> It is also to wonder whether the partially defective drive has
>>> been "failed" and "removed" from the MD set and perhaps
>>> "deleted" using '/sys/block/sdb/device/delete'.
>>>
>>>> Hm, I'm seeing this in dmesg, could it be related? (ioctl lock)
>>>
>>>> [425480.928740] md/raid:md0: read error corrected (8 sectors at
>>>> 223617240 on sdb1)
>>>
>>> Note the "read error corrected" (*corrected*) and that is is "8
>>> sectors" may indicate it is one of the drives with 4096B sectors
>>> that is configured as if it has 512B ones.
>>>
>
> Right, that is how WD20EARS react.
>
>>> [ ... ]
>>>
>>> Overall it is likely that you have just implicitly discovered
>>> how important short settings for Error Recovery Control are, and
>>> to choose drives that allow you to set them:
>>>
>>>  http://www.sabi.co.uk/blog/1103Mar.html#110331
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>  $ sudo smartctl -l scterc,20,20 /dev/sdb
>> smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.2.0-2-ARCH] (local build)
>> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
>>
>> Warning: device does not support SCT Error Recovery Control command
>>
>> :-/ and yes it's a "4KB" drive, a WD20EARS. It failed after almost
>> 11000 hours. Thanks, now I know the reason for the system hang.
>
> Those are not very suited for RAID.  They were the cheapest WD 2TB
> drives in the consumer segment, they don't support TLER/ERC.  And from
> my experience the replacement drives won't last very long, either.  At
> least you have raid6 there...
>

I know. I picked them because they were cheap, I got 5 of them new for
about 380 USD.

> Is the drive that corrects its sectors still in the array (I'd guess
> that)?  If yes, just issue the next rma, error is "drive reacts very
> slowly".  I fear you have to wait for the first resync or do a ddrescue
> with the disk that is still in the array while the array is taken
> offline (that way you don't take the chance of another drive failing
> while resyncing).
>
> All the best,
> Stefan
>>

(cc Linux RAID)
The drive is now out of the array. I had to pull the power to the
system, physically pull the disk, then boot into single user mode.
There I had to do a force assemble (because the array wouldn't
assemble in a not-clean state automatically). That worked fine, so I
did an fsck which turned out OK, and now it's "checking" the array. I
should've probably checked the array before the fsck, but oh well.
Check still in progress.

Thanks,
M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html