Re: RAID6, failed device, unresponsive system?

Mathias Burén <mathias.buren@xxxxxxxxx> · Tue, 17 Jan 2012 12:33:16 +0000

On 17 January 2012 12:02, Peter Grandi <pg@xxxxxxxxxxxxxxxxxxxx> wrote:
> [ ... ]
>
>>> Why is the system unresponsive, shouldn't it still be OK
>>> after a drive failure?
>
> There is a bit of a difference between a "drive failure" and
> some/several bad sectors on a drive.
>
> It is also to wonder whether the partially defective drive has
> been "failed" and "removed" from the MD set and perhaps
> "deleted" using '/sys/block/sdb/device/delete'.
>
>> Hm, I'm seeing this in dmesg, could it be related? (ioctl lock)
>
>> [425480.928740] md/raid:md0: read error corrected (8 sectors at
>> 223617240 on sdb1)
>
> Note the "read error corrected" (*corrected*) and that is is "8
> sectors" may indicate it is one of the drives with 4096B sectors
> that is configured as if it has 512B ones.
>
> [ ... ]
>
> Overall it is likely that you have just implicitly discovered
> how important short settings for Error Recovery Control are, and
> to choose drives that allow you to set them:
>
>  http://www.sabi.co.uk/blog/1103Mar.html#110331
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

 $ sudo smartctl -l scterc,20,20 /dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.2.0-2-ARCH] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Warning: device does not support SCT Error Recovery Control command

:-/ and yes it's a "4KB" drive, a WD20EARS. It failed after almost
11000 hours. Thanks, now I know the reason for the system hang.

Regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html