Re: NCQ change on 2007-12-07

Robert Hancock <hancockr@xxxxxxx> · Wed, 12 Mar 2008 21:07:04 -0600

Lanstein, Alex C wrote:
Hello,

I was asked to CC this list regarding an NCQ issue I'm having with many WD400YR-01PLB0 drives.

I have ~400 WD400YR-01PLB0 drives and some have NCQ errors that temporarily lock drive (and system) until the operation times out.  They look exactly the same in hdparm (same model, firmware version, etc), but there is a sticker on the back of the drive that we believe is a WDC internal version number.  100% of the drives that have AD on the end of the sticker have the problem and 100% of the ones that have AJ (or later) do not.  They all came from different batches and manufacturing locations so we're assuming it was some sort of silent revision.  This does not appear to be the same problem as the spurious NCQ issue thta Tejun fixed in December.

The sticker for a few drives that have the errors:
2061-701310-C00 AD	XS 2R19 2Q3v 1 0006340 6346
2061-701310-C00 AD	XS 2R19 2H28 4 0005330 6335
2061-701310-C00 AD	XS 2R19 1YSS 1 0002330 6332
2061-701310-C00 AD	XW 2R19 5JMQ 1 0005290 6295

The sticker for those that work correctly:
2061-701310-C00 AJ	XS 2R26 12JP 5 0007510 6517
2061-701310-C00 AJ	XS 2R26 0JPY 1 000145R 6451
2061-701310-C00 AJ	XS 2R26 12HQ 5 0007510 6517
2061-701310-C00 AK	XS 2R27 0FM6 4 0001140 7141

This sticker is on the underside of the drive on the plastic on the side of the drive with the legacy power connector

This is the output from hdparm -I, as well as uname -a.  All the drives look exactly the same in hdparm -I, so I'm not going to post more than one.  

[admin@hostname db]# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
        Model Number:       WDC WD4000YR-01PLB0
        Serial Number:      WD-WMAMY1509516
        Firmware Revision:  01.06A01
Standards:
        Supported: 7 6 5 4
        Likely used: 7
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:  268435455
        LBA48  user addressable sectors:  781422768
        device size with M = 1024*1024:      381554 MBytes
        device size with M = 1000*1000:      400088 MBytes (400 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, with device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Recommended acoustic management value: 128, current value: 254
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    NOP cmd
           *    READ BUFFER cmd
           *    WRITE BUFFER cmd
           *    Host Protected Area feature set
           *    Look-ahead
           *    Write cache
           *    Power Management feature set
                Security Mode feature set
           *    SMART feature set
           *    FLUSH CACHE EXT command
           *    Mandatory FLUSH CACHE command
           *    Device Configuration Overlay feature set
           *    48-bit Address feature set
                Automatic Acoustic Management feature set
                SET MAX security extension
           *    SET FEATURES subcommand required to spinup after power up
                Power-Up In Standby feature set
           *    DOWNLOAD MICROCODE cmd
           *    General Purpose Logging feature set
           *    SMART self-test
           *    SMART error logging
Security:
                supported
        not     enabled
        not     locked
        not     frozen
        not     expired: security count
        not     supported: enhanced erase
Checksum: correct
[admin@host db]# uname -a
Linux Phi 2.6.22.6 #1 SMP PREEMPT Thu Feb 7 11:58:42 PST 2008 x86_64 x86_64 x86_64 GNU/Linux

These are a sample of the NCQ issues in dmesg and /var/log/messages.  It appears to be a blocking call so the system locks up for sometimes up to an hour.  To get the errors to occur I need to be doing heavy drive writing for a period of more than a few minutes.

[admin@loki-idp-01 ~]# cat /var/log/messages | grep kernel
Mar 11 06:11:08 loki-idp-01 kernel: ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb idx 0x0
Mar 11 06:11:08 loki-idp-01 kernel: ata1: CPB 1: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:11:08 loki-idp-01 kernel: ata1: CPB 2: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:11:08 loki-idp-01 kernel: ata1: CPB 3: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:11:08 loki-idp-01 kernel: ata1: timeout waiting for ADMA IDLE, stat=0x400
Mar 11 06:11:08 loki-idp-01 kernel: ata1.00: exception Emask 0x0 SAct 0xe SErr 0x0 action 0x2 frozen
Mar 11 06:11:08 loki-idp-01 kernel: ata1.00: cmd 60/08:08:02:74:51/00:00:00:00:00/40 tag 1 cdb 0x0 data 4096 in
Mar 11 06:11:08 loki-idp-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:11:08 loki-idp-01 kernel: ata1.00: cmd 60/10:10:3a:74:51/00:00:00:00:00/40 tag 2 cdb 0x0 data 8192 in
Mar 11 06:11:08 loki-idp-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:11:08 loki-idp-01 kernel: ata1.00: cmd 60/08:18:52:74:51/00:00:00:00:00/40 tag 3 cdb 0x0 data 4096 in
Mar 11 06:11:08 loki-idp-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:11:08 loki-idp-01 kernel: sd 0:0:0:0: [sda] 781422768 512-byte hardware sectors (400088 MB)
Mar 11 06:11:08 loki-idp-01 kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 11 06:11:08 loki-idp-01 kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 11 06:19:21 loki-idp-01 kernel: ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb idx 0x0
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 2: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 3: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 4: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 5: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 6: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: CPB 7: ctl_flags 0x1f, resp_flags 0x2
Mar 11 06:19:21 loki-idp-01 kernel: ata1: timeout waiting for ADMA IDLE, stat=0x400
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: exception Emask 0x0 SAct 0xfc SErr 0x0 action 0x2 frozen
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:10:4b:d0:6c/00:00:2a:00:00/40 tag 2 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:18:5b:d0:6c/00:00:2a:00:00/40 tag 3 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:20:13:d1:6c/00:00:2a:00:00/40 tag 4 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:28:1b:d1:6c/00:00:2a:00:00/40 tag 5 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/08:30:2b:d1:6c/00:00:2a:00:00/40 tag 6 cdb 0x0 data 4096 in
Mar 11 06:19:21 loki-idp-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: ata1.00: cmd 60/10:38:43:9f:6d/00:00:2a:00:00/40 tag 7 cdb 0x0 data 8192 in
Mar 11 06:19:21 loki-idp-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 11 06:19:21 loki-idp-01 kernel: sd 0:0:0:0: [sda] 781422768 512-byte hardware sectors (400088 MB)
Mar 11 06:19:21 loki-idp-01 kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 11 06:19:21 loki-idp-01 kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 11 07:23:13 loki-idp-01 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Mar 11 07:28:33 loki-idp-01 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team

I don't think the driver is to blame for this one. CPB resp_flags of 2 
indicates the controller says that tag is "released", i.e. drive has 
been given the command but not completed. If it's timing out in that 
state, either the NCQ implementation is bad enough that it allows 
certain queued commands to starve for over 30 seconds, or the drive is 
just forgetting about them entirely.

Unfortunately, if the firmware revision does indeed appear the same on 
both sets of drives, the only automated way we could likely address the 
problem would be to blacklist NCQ on all of those drives of that 
model/revision..
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html