Tim Small wrote:
Tejun Heo wrote:
The only constants seem to be libata and ICH7/8.
We must have a bug somewhere in there.
In piix mode or ahci mode? If in piix mode, ich7 and 8 would behave
quite differently. ICH8 has SIDPR so it can hardreset while 7 can't.
ICH SIDPR access had a hardware problem where write to SControl to
clear DET is sometimes ignored which led to occassional hardreset
failure which got fixed recently. The reason why ich's are involved
in those incidents could just be that they are extremely popular.
It's a non-AHCI capable ICH7, so it's in piix mode.
Things to try after such completely drive shutdown are...
Unfortunately I can't do much with this box, as it's a rented box in a
datacentre, however....
* Soft reset the machine. Can BIOS recognize the drive?
Yes, if I either 'echo b > /proc/sysrq-trigger', then the BIOS
recognises the drive, and the box reboot normally.
In many cases I've seen, it's usually that the drive's firmware is
completely hung and only power cycling the drive brought it back. But
then again, there have been some number of cases which didn't get
diagnosed properly, so it's definitely possible that we're doing
something wrong in the driver.
Anyways, if it happens again, please try the above and try to find out
whether the controller or the drive is hung. Also, please keep in
mind that timeouts on 0xEA (flush) is very often indicative of power
OK, I didn't think I was seeing those - is it possible to tell from the
detail which I posted in my original message? As for the potential for
PSU shenanigans - I don't have access to the box to fiddle with that,
unfortunately, but I believe I can stress the I/O subsystem quite
heavily with dd and/or bonnie, but it's only when polling for SMART
status that these errors show up. I've just started dd (to RAID mirror)
+ hdparm -I again to check...
Do the SMART error counters in the OP make this suspicious? Is there
likely to be any different between running smartctl -a and hdparm -I in
terms of code path taken though the kernel, or timings on the hardware,
as far as you know?
..
My theory on the problem when I first had it here, was that doing
a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent
the problem. This was never explored further (by me or others).
Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html