Re: Reduce Timeout on Disk Failure

Paul Clements <Paul.Clements@SteelEye.com> · Tue, 29 Apr 2003 10:06:14 -0400

jim@rubylane.com wrote:
> 
> If this is patched, I hope it is also put into a 2.2 update.  When a
> SW raid is running, a couple of I/O retries might be reasonable, but
> not heroic recovery attempts that would make good sense in a
> single-disk environment.

Yes, the md driver in 2.2 had a ridiculously large retry loop when an
I/O failure occurs...if I counted correctly, I think it did 4096 retries
on I/O failure! This usually means that one of the lower level drivers
ends up hung in a pretty tight error handling loop...

> We did a simple test of powering down an IDE drive that was part of an
> (idle) SW raid, then trying to access the filesystem, and the system
> just locked up.  Maybe it would have eventually come back to life - I
> dunno.

Yep, we tried similar things with a network block device (breaking the
network connection)...we ended up hacking the raid1 and nbd drivers and
inserting schedule() calls just to mitigate the effects of the retries a
little bit...we at least got the system not to hang completely while the
retries were going on... 

> For the curious, we haven't upgraded to 2.4x because whenever I check
> the kernel traffic page, it seems there are still important bugs being
> found and corrected - ones we don't want to experience in a production
> setup.

Well, this particular retry problem does not exist in 2.4. And in
general, as far as software RAID is concerned, 2.4 is a lot better...I
know, at least with raid1, you can fail a device just about anytime you
want (with lots of write activity, during a resync, etc.) and as often
as you want, and it doesn't hang...

--
Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html