SATA drive reset/disable events on ICH7 ata_piix when polling SMART info

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have a couple of Debian Lenny ("2.6.26-2-amd64") boxes on rented hardware, each has a couple of SATA drives:

One has 2x 1TB Seagate Barracuda 7200.11 model ST31000333AS firmware SD35

The other has 2x 2TB WD Caviar Green model WDC WD20EADS-00R6B0 firmware 01.00A01

... the machines are currently set up to run smartd, and also log HDD temp via munin. ata_piix is the driver in use.

The WD machine did this sort of thing a couple of times, which got my attention.

[119061.717865] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[119061.717865] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[119061.717865] ata1.00: status: { DRDY }
[119071.117368] ata1: link is slow to respond, please be patient (ready=0)
[119079.800059] ata1: device not ready (errno=-16), forcing hardreset
[119079.800091] ata1: soft resetting link
[119087.950128] ata1: link is slow to respond, please be patient (ready=0)
[119097.895803] ata1: SRST failed (errno=-16)
[119097.895881] ata1: soft resetting link
[119107.170874] ata1: link is slow to respond, please be patient (ready=0)
[119114.902193] ata1: SRST failed (errno=-16)
[119114.902219] ata1: soft resetting link
[119123.749111] ata1: link is slow to respond, please be patient (ready=0)
[119176.735727] ata1: SRST failed (errno=-16)
[119176.735761] ata1: soft resetting link
[119185.513569] ata1: SRST failed (errno=-16)
[119185.513593] ata1: reset failed, giving up
[119185.513622] ata1.00: disabled
[119185.513643] ata1.01: disabled
[119185.513680] end_request: I/O error, dev sda, sector 39069887
[119185.516684] ata1: EH complete
[119186.013456] sd 0:0:0:0: [sda]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[119186.013456] end_request: I/O error, dev sda, sector 36525807


If I run a continuous "dd of=file ; sync ; rm file ; sync" to a file on the RAID1 mirror of both drives, at the same time as run a continous "smartctl -s on -a /dev/sdX > /dev/null || echo failed", then:

1. The smartctl command fails about once in 20 times, and I get a lot of this happening:

[93058.989603] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [93058.989645] ata1.01: cmd 35/00:00:a4:f2:51/00:04:03:00:00/f0 tag 0 dma 524288 out [93058.993582] ata1.01: status: { DRDY } [93058.993582] ata1: soft resetting link [93090.804353] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [93090.804395] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 pio 512 in [93090.804427] ata1.01: status: { DRDY } [93090.804458] ata1: soft resetting link [93252.493902] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [93252.493913] ata1.01: cmd c8/00:80:4c:d0:83/00:00:00:00:00/fa tag 0 dma 65536 in [93252.493913] ata1.01: status: { DRDY } [93252.493913] ata1: soft resetting link [96265.917847] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [96265.917889] ata1.01: cmd c8/00:80:4c:2c:c1/00:00:00:00:00/fa tag 0 dma 65536 in [96265.921800] ata1.01: status: { DRDY } [96265.921800] ata1: soft resetting link [96405.491834] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [96405.491834] ata1.01: cmd 25/00:00:cc:a6:c3/00:04:0a:00:00/f0 tag 0 dma 524288 in [96405.491834] ata1.01: status: { DRDY } [96413.900149] ata1: link is slow to respond, please be patient (ready=0) [99772.901861] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [99772.901861] ata1.01: cmd ca/00:08:cc:d3:54/00:00:00:00:00/f3 tag 0 dma 4096 out [99772.901861] ata1.01: status: { DRDY } [99783.604235] ata1: link is slow to respond, please be patient (ready=0) [100012.860158] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [100012.860201] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 pio 512 in
[100012.860247] ata1.01: status: { DRDY }
[100012.860281] ata1: soft resetting link

[100256.314912] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [100256.314950] ata1.01: cmd c8/00:80:cc:12:13/00:00:00:00:00/fb tag 0 dma 65536 in
[100256.314997] ata1.01: status: { DRDY }
[100256.315025] ata1: soft resetting link

[101528.503318] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [101528.503318] ata1.01: cmd c8/00:00:4c:c4:2c/00:00:00:00:00/fb tag 0 dma 131072 in
[101528.503318] ata1.01: status: { DRDY }
[101535.883662] ata1: link is slow to respond, please be patient (ready=0)

[107747.382563] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [107747.382605] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 pio 512 in
[107747.386545] ata1.01: status: { DRDY }
[107747.386545] ata1: soft resetting link

[107918.831736] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [107918.831736] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 pio 512 in
[107918.831736] ata1.01: status: { DRDY }
[107918.831736] ata1: soft resetting link


Sometimes the "resetting link" happens a few times, and if it happens enough times, then ata_piix gives up and disables BOTH drives (like the first time), which is a bit annoying - this reset-fails behaviour normally seems to happen when the drives are not doing much (i.e. in normal operation rather than under-test).

If I disable smart data collection (smartd and munin), then the errors seem to stop - which I can do obviously, but would prefer not to.

smartctl -x reports the following interesting-looking stuff on the device which I've been stressing with smartctl:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
...
0x000a  2            5  Device-to-host register FISes sent due to a COMRESET
0x8000  4        79322  Vendor specific

and this on the one where I haven't:

0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x8000  4         6779  Vendor specific



... so I would suspect that this is a bug in the WD drives, except that the same thing seems to occasionally happen on the machine with the Seagate drives:

[1718254.879156] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [1718254.879211] ata1.00: cmd c8/00:08:3c:f1:bf/00:00:00:00:00/e9 tag 0 dma 4096 in [1718254.879213] res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[1718254.879316] ata1.00: status: { DRDY }
[1718262.237404] ata1: link is slow to respond, please be patient (ready=0)
[1718270.057698] ata1: device not ready (errno=-16), forcing hardreset
[1718270.057732] ata1: soft resetting link
[1718277.841779] ata1: link is slow to respond, please be patient (ready=0)
[1718281.134473] ata1.00: configured for UDMA/133
[1718281.192815] ata1.01: configured for UDMA/133
[1718281.192815] ata1: EH complete

[1729049.865692] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [1729049.865692] ata1.00: cmd c8/00:08:dc:b3:bf/00:00:00:00:00/e9 tag 0 dma 4096 in [1729049.865692] res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[1729049.865692] ata1.00: status: { DRDY }
[1729059.627313] ata1: link is slow to respond, please be patient (ready=0)
[1729068.499782] ata1: device not ready (errno=-16), forcing hardreset
[1729068.499823] ata1: soft resetting link
[1729078.434813] ata1: link is slow to respond, please be patient (ready=0)
[1729088.807850] ata1: SRST failed (errno=-16)
[1729088.807881] ata1: soft resetting link
[1729089.582856] ata1.00: configured for UDMA/133


with this on the stressed drive:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2           10  Device-to-host register FISes sent due to a COMRESET


and this on the non-stressed drive:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET


I'd be happy to put a newer kernel on one or both machines to see if that'd have any effect. I also tried doing "hdparm -I" instead of "smartctl -a" for a few hours but that didn't elicit any "frozen" messages (although I should probably run it for a bit longer to have more confidence in that statement).

So, err I suppose that this could be a bug in:

. smartctl
. both HD firmwares
. ata_piix (certainly disabling both drives seems a bit drastic, but I don't know if this is a function of the hardware)
. the ICH7 hardware

unfortunately as I don't own the hardware, I'm not in a position to get a different SATA controller in the boxes to eliminate the last two.

Any ideas welcome....

Cheers,

Tim.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux