unreadable drives can be synchronized?

"Colin McCabe" <colin.p.mccabe@xxxxxxxxx> · Wed, 16 May 2007 11:50:15 -0400

Hi all,

I am running software RAID on Linux 2.6.21.

While experimenting with adding and removing devices from the RAID array, I
noticed something very troubling. I have a bad drive (let's call it drive B)
which gets random read errors. I also have a good drive, call it drive A.

B can synchronize with A. But then, if I remove A from the raid array, A
cannot be re-added. This is because the bad drive, B, cannot be read from.

Basically, B appears to be "write-only"; it will never return an error on a
write, but just try to read from it, and you will be sorry.

Writing is fine:
[root@cmccabe-devel root]# dd if=/dev/zero of=/dev/sdb bs=524288
dd: writing `/dev/sdb': No space left on device
114464+0 records in
114463+0 records out

Reading is not:
[root@cmccabe-devel root]# dd if=/dev/sdb of=/dev/null bs=524288
ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x2 frozen
ata1.00: cmd 60/00:00:00:b0:01/01:00:00:00:00/40 tag 0 cdb 0x0 data 131072 in
[ ... copious errors ... ]

I have disabled write caching using hdparm -W0.
Both drives are: Fujitsu MHV2060BH, 60 GB, Serial ATA
The SATA controller is: ICH6

My problem is that even though B gets into the synchronized state, it is no
good at all. This is potentially misleading, and if someone removes A after
synchronizing B, the system will probably crash, since there will be no good
drives left.

I wonder if anyone else is interested in a "paranoid recovery" mode where the
md layer tests the data that has been written. Even if this doubles the
recovery time, I think that it would be desirable for many applications.

Colin
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html