Re: Disk stuck in error recovery loop with AHCI

Jim Paris <jim@xxxxxxxx> · Fri, 23 Feb 2007 02:28:26 -0500

I wrote:
> I've been trying to track down data corruption I'm seeing on my
> server.

Turns out it was a bad disk.  Not a media error, but maybe bad RAM or
logic on the drive.

> I saw an error with AHCI that I hadn't seen before with the other
> controllers.
...
> Because the error at [11588.19xx] was repeated 30 times, I suspected
> NCQ.  I set the queue_depth on all 6 disks down to 1, and haven't seen
> the same problem since

It's not related to NCQ.  I still saw the problem with it disabled,
and it finally went away when I enabled spread-spectrum clocking in
BIOS, even once I turned NCQ back on.  So this report is bogus.

Still, it seems that some improvements could be made to the EH when
this sort of thing happens.  For example, after "speed down requested
but no transfer mode left" a few times in a row, maybe it would make
sense to just fail the disk and give up.  That would have allowed
higher layers like MD to recover.

-jim
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html