Re: libata-scsi: ata_to_sense_error handling status 0x40

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/2/22 15:34, Peter Fröhlich wrote:
> On Fri, Sep 2, 2022 at 4:35 AM Damien Le Moal
> <damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:
>> Your drive seems to be an exception to my (1) statement and the error it
>> returns seems weird enough that the stat_table ends up being used.
>> Could you send a dmesg output of a failed command so that we can see the
>> err_mask etc info for the failed command ? And it would be good to add a
>> print of the drv_stat and drv_err parameters passed to
>> ata_to_sense_error() for the failures you are seeing. That would help
>> trying to figure out what your drive is attempting to signal.
> 
> I don't think the drive wants to "signal" anything, instead it simply
> "disappears" at some point. The "original" error is "Emask 0x4
> (timeout)". So here's an example from early on when I had not made
> many kernel changes yet:

Sounds like the drive FW is crashing...

> -----CUT-----
> ...
> [  516.296397] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> [  516.296399] ata9.00: failed command: WRITE DMA

Are you running this drive with device/queue_depth set to 1 ? What is
issuing a WRITE DMA instead of the NCQ equivalent ? Is this a passthrough
command ?

> [  516.296402] ata9.00: cmd ca/00:23:51:03:4b/00:00:00:00:00/ed tag 4
> dma 17920 out
>                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
> 0x4 (timeout)
> [  516.296403] ata9.00: status: { DRDY }
> ...
> [  516.761214] ata9: translated ATA stat/err 0x40/00 to SCSI
> SK/ASC/ASCQ 0x5/21/04
> [  516.761215] ata9.00: device reported invalid CHS sector 0

Yeah... An unaligned write error should normally also signal the LBA that
was being accessed when the error occurred. ata_tf_read_block() does not
see the LBA flag set, thinks it is CHS and ends up with garbage
information. We can ignored this one. The problem is the bogus unaligned
write error in the first place.

> [  516.761220] sd 8:0:0:0: [sdk] tag#4 scsi_eh_8: flush finish cmd
> [  516.761224] sd 8:0:0:0: [sdk] tag#4 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [  516.761226] sd 8:0:0:0: [sdk] tag#4 Sense Key : Illegal Request [current]
> [  516.761228] sd 8:0:0:0: [sdk] tag#4 Add. Sense: Unaligned write command
> [  516.761229] sd 8:0:0:0: [sdk] tag#4 CDB: Write(16) 8a 00 00 00 00
> 00 0d 4b 03 51 00 00 00 23 00 00
> ...
> -----CUT-----
> 
> That "translated" line is only output because I changed "if (verbose)"
> to "if (1)" in that kernel. Also note the bizarre "CHS" error which
> only happens on some of these, not all; I had mentioned before that I
> am trying to track down how it happens that the LBA bit suddenly
> disappears (it might have to do with the hardreset being in process at
> this point and this message racing against the new IDENTIFY?). Using

See above for the explanation. That message is bogus because the error is
bogus too.

> the trace facility I can *sometimes* see the command being issued and
> then 30 seconds later the timeout happening; sometimes I just get the
> timeout and I *cannot* find when the command was issued in the trace,
> another thing that seems bizarre to me.

That really sound like a device FW crash (the device stops responding), but...

> Note that I didn't ask for help with that intentionally, I still think
> that I am too far away from a proper diagnosis to have a fruitful
> conversation about where the timeouts originate and why. We've checked
> against power issues and the like, and again, this happens only when
> the drive sits behind a SATA controller, not when it's behind a SAS
> controller.
> 
>> Also, please send the output of "hdparm -I" for that SSD please, so that
>> we have information about what standard it is (supposedly) following.
> 
> See below, but I don't think the specific drive is relevant. The same
> "problem" shows up with a different brand/model as well, again only in
> the SATA context, not for SAS.

...since it happens with other drives, it may be something to do with the
host AHCI adapter. What are you using ? Do you get the same behaviour if
you use a different host with a different AHCI adapter ?

> 
> Cheers,
> Peter
> 
> -----CUT-----
> /dev/sde:
> 
> ATA device, with non-removable media
>     Model Number:       WDC  WDS400T2B0A-00SM50

I know this vendor well :)

>     Serial Number:      2113CN420743
>     Firmware Revision:  415020WD
>     Media Serial Num:
>     Media Manufacturer:
>     Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II
> Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
> Standards:
>     Used: unknown (minor revision code 0x005e)
>     Supported: 11 10 9 8 7 6 5
>     Likely used: 11
> Configuration:
>     Logical        max    current
>     cylinders    16383    0
>     heads        16    0
>     sectors/track    63    0
>     --
>     LBA    user addressable sectors:   268435455
>     LBA48  user addressable sectors:  7814037168
>     Logical  Sector size:                   512 bytes
>     Physical Sector size:                   512 bytes
>     Logical Sector-0 offset:                  0 bytes
>     device size with M = 1024*1024:     3815447 MBytes
>     device size with M = 1000*1000:     4000787 MBytes (4000 GB)
>     cache/buffer size  = unknown
>     Form Factor: 2.5 inch
>     Nominal Media Rotation Rate: Solid State Device
> Capabilities:
>     LBA, IORDY(can be disabled)
>     Queue depth: 32
>     Standby timer values: spec'd by Standard, no device specific minimum
>     R/W multiple sector transfer: Max = 1    Current = 1
>     Advanced power management level: disabled
>     DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
>          Cycle time: min=120ns recommended=120ns
>     PIO: pio0 pio1 pio2 pio3 pio4
>          Cycle time: no flow control=120ns  IORDY flow control=120ns
> Commands/features:
>     Enabled    Supported:
>        *    SMART feature set
>             Security Mode feature set
>        *    Power Management feature set
>        *    Write cache
>        *    Look-ahead
>        *    WRITE_BUFFER command
>        *    READ_BUFFER command
>        *    DOWNLOAD_MICROCODE
>             Advanced Power Management feature set
>        *    48-bit Address feature set
>        *    Mandatory FLUSH_CACHE
>        *    FLUSH_CACHE_EXT
>        *    SMART error logging
>        *    SMART self-test
>        *    General Purpose Logging feature set
>        *    64-bit World wide name
>        *    WRITE_UNCORRECTABLE_EXT command
>        *    {READ,WRITE}_DMA_EXT_GPL commands
>        *    Segmented DOWNLOAD_MICROCODE
>             unknown 119[8]
>        *    Gen1 signaling speed (1.5Gb/s)
>        *    Gen2 signaling speed (3.0Gb/s)
>        *    Gen3 signaling speed (6.0Gb/s)
>        *    Native Command Queueing (NCQ)
>        *    Phy event counters
>        *    READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
>             DMA Setup Auto-Activate optimization
>             Device-initiated interface power management
>             Asynchronous notification (eg. media change)
>        *    Software settings preservation
>             Device Sleep (DEVSLP)
>        *    SANITIZE feature set
>        *    BLOCK_ERASE_EXT command
>        *    DOWNLOAD MICROCODE DMA command
>        *    WRITE BUFFER DMA command
>        *    READ BUFFER DMA command
>        *    Data Set Management TRIM supported (limit 8 blocks)
>        *    Deterministic read ZEROs after TRIM
> Security:
>     Master password revision code = 65534
>         supported
>     not    enabled
>     not    locked
>     not    frozen
>     not    expired: security count
>         supported: enhanced erase
>     2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
> Logical Unit WWN Device Identifier: 5001b444a70c2c64
>     NAA        : 5
>     IEEE OUI    : 001b44
>     Unique ID    : 4a70c2c64
> Device Sleep:
>     DEVSLP Exit Timeout (DETO): 30 ms (drive)
>     Minimum DEVSLP Assertion Time (MDAT): 30 ms (drive)
> Checksum: correct
> -----CUT-----

-- 
Damien Le Moal
Western Digital Research





[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux