Re: faulty disk testing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello, Ric.

[cc'ing Jeff]

Ric Wheeler wrote:
Hi Tejun,

We have been trying to inject some errors on some drives & validate that the new error handling kicks out drives.

Using 2.6.18rc3 on a box with 4 drives - 3 good & one with an artificially created ecc error in the 4-way MD RAID1 partition.

The error handling worked through the various transitions, but did not give up on the drive well enough to let the boot continue using the other 3.

I suppose the introduced errors are transient and some sectors complete IO successfully between errors, right? As long as the drive responds to recovery action (provide signature on reset, ID data on IDENTIFY and responds to SETFEATURES), libata assumes the error condition is transient and let the drive continue operating.

So, no, libata won't drop a drive unless it fails to respond to recovery sequence. libata just doesn't have enough information about how devices are used to determine whether a device is failing too often to be useful. e.g. there is a very big difference between a harddrive serving rootfs by itself and a drive which is in md array w/ several spares.

I plan to look at the state of the drive with an analyzer tomorrow to make sure that the drive is not holding the bus or something & try your latest "new init" git tree code.

New init stuff won't change anything regarding this.

What it looks like is a soft hang - maybe the box is stuck in ata_port_wait_eh() which never seems to timeout on a bus that does not recover?

It seems like we need a separate mechanism here to implement policy for longer-term handling for frequently-failing devices. Probably providing some monitoring sysfs nodes should do it - some error history w/ recovery time record and stuff such that user management process can decide to pull the plug if seems appropriate.

Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux