Lanstein, Alex C wrote:
Hello,
I was asked to CC this list regarding an NCQ issue I'm having with many WD400YR-01PLB0 drives.
I have ~400 WD400YR-01PLB0 drives and some have NCQ errors that temporarily lock drive (and system) until the operation times out. They look exactly the same in hdparm (same model, firmware version, etc), but there is a sticker on the back of the drive that we believe is a WDC internal version number. 100% of the drives that have AD on the end of the sticker have the problem and 100% of the ones that have AJ (or later) do not. They all came from different batches and manufacturing locations so we're assuming it was some sort of silent revision. This does not appear to be the same problem as the spurious NCQ issue thta Tejun fixed in December.
The sticker for a few drives that have the errors:
2061-701310-C00 AD XS 2R19 2Q3v 1 0006340 6346
2061-701310-C00 AD XS 2R19 2H28 4 0005330 6335
2061-701310-C00 AD XS 2R19 1YSS 1 0002330 6332
2061-701310-C00 AD XW 2R19 5JMQ 1 0005290 6295
The sticker for those that work correctly:
2061-701310-C00 AJ XS 2R26 12JP 5 0007510 6517
2061-701310-C00 AJ XS 2R26 0JPY 1 000145R 6451
2061-701310-C00 AJ XS 2R26 12HQ 5 0007510 6517
2061-701310-C00 AK XS 2R27 0FM6 4 0001140 7141
This sticker is on the underside of the drive on the plastic on the side of the drive with the legacy power connector
This is the output from hdparm -I, as well as uname -a. All the drives look exactly the same in hdparm -I, so I'm not going to post more than one.
[admin@hostname db]# hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: WDC WD4000YR-01PLB0
Serial Number: WD-WMAMY1509516
Firmware Revision: 01.06A01
Standards:
Supported: 7 6 5 4
Likely used: 7
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 781422768
device size with M = 1024*1024: 381554 MBytes
device size with M = 1000*1000: 400088 MBytes (400 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* NOP cmd
* READ BUFFER cmd
* WRITE BUFFER cmd
* Host Protected Area feature set
* Look-ahead
* Write cache
* Power Management feature set
Security Mode feature set
* SMART feature set
* FLUSH CACHE EXT command
* Mandatory FLUSH CACHE command
* Device Configuration Overlay feature set
* 48-bit Address feature set
Automatic Acoustic Management feature set
SET MAX security extension
* SET FEATURES subcommand required to spinup after power up
Power-Up In Standby feature set
* DOWNLOAD MICROCODE cmd
* General Purpose Logging feature set
* SMART self-test
* SMART error logging
Security:
supported
not enabled
not locked
not frozen
not expired: security count
not supported: enhanced erase
Checksum: correct
[admin@host db]# uname -a
Linux Phi 2.6.22.6 #1 SMP PREEMPT Thu Feb 7 11:58:42 PST 2008 x86_64 x86_64 x86_64 GNU/Linux
These are a sample of the NCQ issues in dmesg and /var/log/messages. It appears to be a blocking call so the system locks up for sometimes up to an hour. To get the errors to occur I need to be doing heavy drive writing for a period of more than a few minutes.
[admin@loki-idp-01 ~]# cat /var/log/messages | grep kernel
Mar 11 06:11:08 loki-idp-01 kernel: ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb idx 0x0
Mar 11 06:11:08 loki-idp-01 kernel: ata1: CPB 1: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:11:08 loki-idp-01 kernel: ata1: CPB 2: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:11:08 loki-idp-01 kernel: ata1: CPB 3: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:11:08 loki-idp-01 kernel: ata1: timeout waiting for ADMA IDLE, stat=0x400
Mar 11 06:11:08 loki-idp-01 kernel: ata1.00: exception Emask 0x0 SAct 0xe SErr 0x0 action 0x2 frozen
Mar 11 06:11:08 loki-idp-01 kernel: ata1.00: cmd 60/08:08:02:74:51/00:00:00:00:00/40 tag 1 cdb 0x0 data 4096 in
Mar 11 06:11:08 loki-idp-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:11:08 loki-idp-01 kernel: ata1.00: cmd 60/10:10:3a:74:51/00:00:00:00:00/40 tag 2 cdb 0x0 data 8192 in
Mar 11 06:11:08 loki-idp-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:11:08 loki-idp-01 kernel: ata1.00: cmd 60/08:18:52:74:51/00:00:00:00:00/40 tag 3 cdb 0x0 data 4096 in
Mar 11 06:11:08 loki-idp-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:11:08 loki-idp-01 kernel: sd 0:0:0:0: [sda] 781422768 512-byte hardware sectors (400088 MB)
Mar 11 06:11:08 loki-idp-01 kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 11 06:11:08 loki-idp-01 kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 11 06:19:21 loki-idp-01 kernel: ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb idx 0x0
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 2: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 3: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 4: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 5: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 6: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 7: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: timeout waiting for ADMA IDLE, stat=0x400
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: exception Emask 0x0 SAct 0xfc SErr 0x0 action 0x2 frozen
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:10:4b:d0:6c/00:00:2a:00:00/40 tag 2 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:18:5b:d0:6c/00:00:2a:00:00/40 tag 3 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:20:13:d1:6c/00:00:2a:00:00/40 tag 4 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:28:1b:d1:6c/00:00:2a:00:00/40 tag 5 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:30:2b:d1:6c/00:00:2a:00:00/40 tag 6 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/10:38:43:9f:6d/00:00:2a:00:00/40 tag 7 cdb 0x0 data 8192 in
Mar 11 06:19:21 loki-idp-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: sd 0:0:0:0: [sda] 781422768 512-byte hardware sectors (400088 MB)
Mar 11 06:19:21 loki-idp-01 kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 11 06:19:21 loki-idp-01 kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 11 07:23:13 loki-idp-01 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Mar 11 07:28:33 loki-idp-01 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
I don't think the driver is to blame for this one. CPB resp_flags of 2
indicates the controller says that tag is "released", i.e. drive has
been given the command but not completed. If it's timing out in that
state, either the NCQ implementation is bad enough that it allows
certain queued commands to starve for over 30 seconds, or the drive is
just forgetting about them entirely.
Unfortunately, if the firmware revision does indeed appear the same on
both sets of drives, the only automated way we could likely address the
problem would be to blacklist NCQ on all of those drives of that
model/revision..
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html