ICH6-M libata disk timeouts in 2.6.19 (except with small queue depth)

Mike Accetta <maccetta@xxxxxxxxxxxxxxxxxx> · Thu, 18 Jan 2007 21:49:37 -0500

The subject of this thread has been adjusted to reflect my revised
understanding of the problem.  I had previously thought the problem
with a RAID1 re-sync generating timeout errors on the active (reading)
disk appeared with 2.6.19 but it turns out the problem was also present
in 2.6.18.  I was running the 2.6.18 tests with a modular kernel and
hadn't loaded the ahci module in that scenario (so there was no queueing)
but was using a kernel with SATA built in for the 2.6.19 test (which
enabled queueing).  When ahci is loaded with 2.6.18, the same timeout
problem appears.

The distinguishing factor appears to be the queue depth (4 works, 5
and various values up to and including 31 fail) not the kernel version.
I am going to try running with the queue depth clamped at 4 to see if
this consistently masks the problem.  I may also try some more experiments
if I have the time, like instrumenting what command was issued right
before the group that all time out or increasing the SCSI timeout,
in order to get some more insight into what is going on at the time of
the failure.

I'd be happy to run any tests people with expertise in this area can
suggest to help better understand what is going on.  I guess this problem
looks like the disk becomes wedged somehow at this point.  For what its
worth, it is usually possible to regain control of the system if I stop
the RAID re-sync after the first timeout is reported but when I let it
try to run to completion, the SATA sub-system system usually eventually
gets stuck issuing reset after reset with no success and a power-cycle
is the only recourse.
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html