Disk stuck in error recovery loop with AHCI

Jim Paris <jim@xxxxxxxx> · Wed, 21 Feb 2007 00:20:22 -0500

Hi,

For the past month or so I've been trying to track down data
corruption I'm seeing on my server.  It's something like 2-3 bits per
100 gigs of data, but only shows up on my 4-disk RAID5.  RAM is ECC
and fine.  Anyway, as part of my testing I swapped SATA controllers
many times (sata_nv, sata_sil24, sata_sil) then eventually switched
the motherboard/cpu/ram entirely and am now using an Intel chipset
with AHCI.  Turns out the corruption is still there (my current theory
is a bad disk or cables, doing more testing at the moment), but I saw
an error with AHCI that I hadn't seen before with the other
controllers.

Kernel is 2.6.20.  The onboard 6-port controller is:

00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA Storage Controller AHCI (rev 09)
00:1f.2 0106: 8086:2681 (rev 09)

It is set up in "Enhanced AHCI" mode in BIOS, because that's the only
mode in which all the disks are recognized.

After setting up the new machine, I was doing a MD "repair" operation on
the first two disks (RAID1) and the last four disks (RAID5):
   echo repair > /sys/block/md1/md/sync_action
   echo repair > /sys/block/md2/md/sync_action

Eventually I got the following errors on one of the disks in the raid1: 
(full dmesg is attached)

...
Feb 20 01:05:07 psychosis kernel: [11473.137130]          res 40/00:08:10:d7:4e/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Feb 20 01:05:07 psychosis kernel: [11473.449587] ata3: soft resetting port
Feb 20 01:05:07 psychosis kernel: [11473.952880] ata3: softreset failed, retrying in 5 secs
Feb 20 01:05:12 psychosis kernel: [11478.947007] ata3: hard resetting port
Feb 20 01:05:20 psychosis kernel: [11486.256966] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:05:43 psychosis kernel: [11509.228781] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:05:48 psychosis kernel: [11514.223105] ata3: hard resetting port
Feb 20 01:05:48 psychosis kernel: [11514.542508] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:05:48 psychosis kernel: [11514.544548] ata3.00: configured for UDMA/133
Feb 20 01:05:48 psychosis kernel: [11514.544559] ata3: EH complete
Feb 20 01:05:48 psychosis kernel: [11514.544685] SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 01:05:48 psychosis kernel: [11514.544709] sda: Write Protect is off
Feb 20 01:05:48 psychosis kernel: [11514.544743] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 20 01:05:57 psychosis kernel: [11523.816196]          res 40/00:00:10:7d:5f/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Feb 20 01:05:58 psychosis kernel: [11524.128039] ata3: soft resetting port
Feb 20 01:05:58 psychosis kernel: [11524.631101] ata3: softreset failed, retrying in 5 secs
Feb 20 01:06:03 psychosis kernel: [11529.625457] ata3: hard resetting port
Feb 20 01:06:10 psychosis kernel: [11536.935417] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:06:33 psychosis kernel: [11559.907230] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:06:39 psychosis kernel: [11564.901852] ata3: hard resetting port
Feb 20 01:06:39 psychosis kernel: [11565.221011] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:06:39 psychosis kernel: [11565.223107] ata3.00: configured for UDMA/133
Feb 20 01:06:39 psychosis kernel: [11565.223118] ata3: EH complete
Feb 20 01:06:39 psychosis kernel: [11565.223245] SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 01:06:39 psychosis kernel: [11565.223268] sda: Write Protect is off
Feb 20 01:06:39 psychosis kernel: [11565.223301] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 20 01:07:02 psychosis kernel: [11588.192660]          res 40/00:50:10:76:88/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Feb 20 01:07:02 psychosis kernel: [11588.192788]          res 40/00:50:10:76:88/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
      (repeated 30 times total)
Feb 20 01:07:02 psychosis kernel: [11588.194209]          res 40/00:50:10:76:88/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Feb 20 01:07:02 psychosis kernel: [11588.504333] ata3: soft resetting port
Feb 20 01:07:03 psychosis kernel: [11589.007184] ata3: softreset failed, retrying in 5 secs
Feb 20 01:07:08 psychosis kernel: [11594.001540] ata3: hard resetting port
Feb 20 01:07:15 psychosis kernel: [11601.311499] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:07:38 psychosis kernel: [11624.283314] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:07:43 psychosis kernel: [11629.277637] ata3: hard resetting port
Feb 20 01:07:43 psychosis kernel: [11629.597039] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:07:43 psychosis kernel: [11629.598746] ata3.00: configured for UDMA/133
Feb 20 01:07:43 psychosis kernel: [11629.598898] ata3: EH complete
...

This kept happening over and over, reducing the speed each time:

Feb 20 01:18:59 psychosis kernel: [12304.433025] ata3.00: limiting speed to UDMA/66
Feb 20 01:20:42 psychosis kernel: [12406.788586] ata3.00: limiting speed to UDMA/44
Feb 20 01:22:06 psychosis kernel: [12490.909166] ata3.00: limiting speed to UDMA/33
Feb 20 01:23:24 psychosis kernel: [12568.798521] ata3.00: limiting speed to UDMA/25
Feb 20 01:24:36 psychosis kernel: [12640.161835] ata3.00: limiting speed to UDMA/16
Feb 20 01:26:05 psychosis kernel: [12729.369247] ata3.00: limiting speed to PIO4
Feb 20 01:27:17 psychosis kernel: [12800.722750] ata3.00: limiting speed to PIO3
Feb 20 01:28:28 psychosis kernel: [12871.570376] ata3.00: limiting speed to PIO2
Feb 20 01:29:39 psychosis kernel: [12942.417995] ata3.00: limiting speed to PIO1
Feb 20 01:58:27 psychosis kernel: [13013.265629] ata3.00: limiting speed to PIO0

At this point I remember seeing on the console an error about how
speed-down was requested, but no lower speed modes were available; not
sure why that didn't get into the log.  But the messages kept coming:

Feb 20 01:58:27 psychosis kernel: [13650.934246]          res 40/00:20:10:ae:40/00:00:0d:00:00/40 Emask 0x4 (timeout)
Feb 20 01:58:27 psychosis kernel: [13651.245577] ata3: soft resetting port
Feb 20 01:58:27 psychosis kernel: [13651.245611] ata3: softreset failed, retrying in 5 secs
Feb 20 01:58:27 psychosis kernel: [13656.239963] ata3: hard resetting port
Feb 20 01:58:27 psychosis kernel: [13663.549923] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:58:27 psychosis kernel: [13686.521729] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:58:27 psychosis kernel: [13691.516061] ata3: hard resetting port
Feb 20 01:58:27 psychosis kernel: [13691.835461] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:58:27 psychosis kernel: [13691.837090] ata3.00: configured for PIO0
Feb 20 01:58:27 psychosis kernel: [13691.837106] ata3: EH complete
Feb 20 01:58:27 psychosis kernel: [13721.785875]          res 40/00:20:10:ae:40/00:00:0d:00:00/40 Emask 0x4 (timeout)
Feb 20 01:58:27 psychosis kernel: [13722.097199] ata3: soft resetting port
Feb 20 01:58:27 psychosis kernel: [13722.097236] ata3: softreset failed, retrying in 5 secs
Feb 20 01:58:27 psychosis kernel: [13727.091584] ata3: hard resetting port
Feb 20 01:58:27 psychosis kernel: [13734.401547] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:58:27 psychosis kernel: [13757.373357] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:58:27 psychosis kernel: [13762.367683] ata3: hard resetting port
Feb 20 01:58:27 psychosis kernel: [13762.687085] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:58:27 psychosis kernel: [13762.689015] ata3.00: configured for PIO0
Feb 20 01:58:27 psychosis kernel: [13762.689027] ata3: EH complete

etc etc.  Since this was one of the two disks in the RAID1 root
filesystem, all processes just froze while it tried over and over to
access the disk and went through the error handling sequence.  I guess
since the EH thought it completed successfully, it was never reported
to the MD layer as being a bad disk.

At this point I realized things were stuck.  I plugged a keyboard in,
then removed all 6 disks and replugged them (not knowing which one was
ata3).  The last one I tried was finally ata3, and I left that one out
so it would finally fail the disk:

...
Feb 20 01:58:27 psychosis kernel: [14665.593082] ata3: SATA link down (SStatus 0 SControl 300)
Feb 20 01:58:27 psychosis kernel: [14665.593089] ata3.00: disabled
Feb 20 01:58:27 psychosis kernel: [14666.096260] sd 2:0:0:0: SCSI error: return code = 0x08000002
Feb 20 01:58:27 psychosis kernel: [14666.096264] sda: Current [descriptor]: sense key: Aborted Command
Feb 20 01:58:27 psychosis kernel: [14666.096267]     Additional sense: No additional sense information
Feb 20 01:58:27 psychosis kernel: [14666.096270] Descriptor sense data with sense descriptors (in hex):
Feb 20 01:58:27 psychosis kernel: [14666.096272]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Feb 20 01:58:27 psychosis kernel: [14666.096278]         00 00 00 00 
Feb 20 01:58:27 psychosis kernel: [14666.096282] end_request: I/O error, dev sda, sector 222342032
Feb 20 01:58:27 psychosis kernel: [14666.096731] sd 2:0:0:0: SCSI error: return code = 0x00010000
Feb 20 01:58:27 psychosis kernel: [14666.096734] end_request: I/O error, dev sda, sector 222342800
Feb 20 01:58:27 psychosis kernel: [14666.097039] md: super_written gets error=-5, uptodate=0
Feb 20 01:58:27 psychosis kernel: [14666.097044] ^IOperation continuing on 1 devices
..

Now that md failed the disk, the system came back to life and was
usable again.  I plugged ata3 back in, and it came up fine:

Feb 20 02:02:04 psychosis kernel: [14883.936792] ata3: waiting for device to spin up (8 secs)
Feb 20 02:02:12 psychosis kernel: [14892.125029] ata3: soft resetting port
Feb 20 02:02:12 psychosis kernel: [14892.125131] ata3: softreset failed, retrying in 5 secs
Feb 20 02:02:17 psychosis kernel: [14897.119415] ata3: hard resetting port
Feb 20 02:02:25 psychosis kernel: [14904.832598] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 02:02:48 psychosis kernel: [14927.804508] ata3: hardreset failed, retrying in 5 secs
Feb 20 02:02:53 psychosis kernel: [14932.798737] ata3: hard resetting port
Feb 20 02:02:54 psychosis kernel: [14933.521359] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 02:02:54 psychosis kernel: [14933.522135] ata3.00: ATA-7, max UDMA/133, 625142448 sectors: LBA48 NCQ (depth 31/32)
Feb 20 02:02:54 psychosis kernel: [14933.523485] ata3.00: configured for UDMA/133
Feb 20 02:02:54 psychosis kernel: [14933.523494] ata3: EH complete
Feb 20 02:02:54 psychosis kernel: [14933.523577] scsi 2:0:0:0: Direct-Access     ATA      ST3320620AS      3.AA PQ: 0 ANSI: 5
Feb 20 02:02:54 psychosis kernel: [14933.523661] SCSI device sdh: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 02:02:54 psychosis kernel: [14933.523672] sdh: Write Protect is off
Feb 20 02:02:54 psychosis kernel: [14933.523687] SCSI device sdh: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 20 02:02:54 psychosis kernel: [14933.523741] SCSI device sdh: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 02:02:54 psychosis kernel: [14933.523750] sdh: Write Protect is off
Feb 20 02:02:54 psychosis kernel: [14933.523764] SCSI device sdh: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 20 02:02:54 psychosis kernel: [14933.523768]  sdh: sdh1 sdh2
Feb 20 02:02:54 psychosis kernel: [14933.539428] sd 2:0:0:0: Attached scsi disk sdh

Because the error at [11588.19xx] was repeated 30 times, I suspected
NCQ.  I set the queue_depth on all 6 disks down to 1, and haven't seen
the same problem since, although once I did get:

Feb 20 22:01:01 psychosis kernel: [86680.771935]          res 50/00:00:57:58:70/00:00:00:00:00/e2 Emask 0x10 (ATA bus error)
Feb 20 22:01:01 psychosis kernel: [86681.083496] ata5: soft resetting port
Feb 20 22:01:01 psychosis kernel: [86681.083540] ata5: softreset failed, retrying in 5 secs
Feb 20 22:01:06 psychosis kernel: [86686.077882] ata5: hard resetting port
Feb 20 22:01:13 psychosis kernel: [86693.387842] ata5: port is slow to respond, please be patient (Status 0x80)
Feb 20 22:01:36 psychosis kernel: [86716.359653] ata5: hardreset failed, retrying in 5 secs
Feb 20 22:01:41 psychosis kernel: [86721.353980] ata5: hard resetting port
Feb 20 22:01:41 psychosis kernel: [86721.673381] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 22:01:41 psychosis kernel: [86721.675047] ata5.00: configured for UDMA/133
Feb 20 22:01:41 psychosis kernel: [86721.675057] ata5: EH complete
Feb 20 22:01:42 psychosis kernel: [86721.687806] SCSI device sdc: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 22:01:42 psychosis kernel: [86721.690191] sdc: Write Protect is off
Feb 20 22:01:42 psychosis kernel: [86721.694914] SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

I know it's been mentioned before that some disks have broken NCQ, but
it doesn't look like there's much in the kernel NCQ blacklist at the
moment, and the strange part is that I've never seen this problem with
the same disks and any of the other controllers I've tried that
supported NCQ (sata_nv, sata_sil24).  So I guess my concern here is
that this is actually related to AHCI in some way?

BTW - I just realized that dmesg has some error messages that are
missing from /var/log/messages, like the actual "exception Emask.."
lines.  Not sure what's up with that, but I'll give dmesg output too.

Full logs at
  http://jim.sh/~jim/ide/
Note that there's an identical disk that I sometimes had plugged into
the sata_sil24 controller; it wasn't being used at the time.

-jim
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html