Re: faulty disk testing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tejun Heo wrote:
Hello, Mark.

Mark Lord wrote:

Sure it does.  It can determine the number of consecutive failures on
the same drive/channel, and it can also count intervening successes, if any.

From that, at a minimum, it could notice that the same drive has gone 'round the error treadmill (say) 20 times in a row, with no other I/O possible on it
because it has yet to successfully complete the reset+reinit phase.


If a device fails reset+reinit phase a few times, libata surely drops the device, but I don't think the kernel can drop a device because it failed, say, 20 consecutive IO commands when it can respond to reset and reinit. That's where policy needs to come in, IMHO.

For Ric's case, I'm waiting for more info. If EH is looping forever without reporting to upper layer, it definitely needs fixing, but I don't think that's the case.

Let me know what is useful here - I am working on the quick run with your new-init git tree & will do a second run with ATA_DEBUG enabled. I also want to validate the disk interaction on the analyzer this morning.


Such a drive is a candidate for pushing the error upstairs,
and possibly for getting offlined.

Fancier fault-handling is also possible, but the bare minimum is that we
must not get stuck forever looping in the EH code. Eventually a failed status
has to be returned to the layers above, I think.


Error is always pushed upstairs. libata itself doesn't initiate any kind of retrials. That's upto high level driver - in this case, sd.

If the error does pop out from SD, MD should (and has in the past) drop the drive from the array.

This certainly could be either the SD layer or the MD layer on top of that - it seems that my hardware friend injected the faulty sector onto the MD super block, so it might be a special path through MD.



One of the problems is that currently libata EH can take some minutes recovering from an error condition. With partial request retry from sd, a batch of consecutive bad sectors can make recovery take a really long time. This needs fixing.

So far, the new-init build has been running the recovery in the lab for about 40 minutes ;-)

regards,

ric

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux