Re: failed to set xfermode (err_mask=0x40): READ LOG DMA EXT failed, trying PIO

Paul Menzel <pmenzel@xxxxxxxxxxxxx> · Sun, 5 May 2019 10:41:09 +0200

Dear Linux folks,

On 01.05.19 11:34, Paul Menzel wrote:

On an MSI B350M MORTAR (MS-7A37) with all firmwares, and the AHCI SSD 
Crucial MX500, *sometimes* boot and resume are delayed by four to five 
seconds, and the log contains the message *failed to set xfermode*.

[    0.474150] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 0)
[    0.474267] microcode: Microcode Update Driver: v2.2.
[    0.474294] AVX2 version of gcm_enc/dec engaged.
[    0.474295] AES CTR mode by8 optimization enabled
[    0.474304] ata9.00: supports DRM functions and may not be fully accessible
[    0.474368] ata9.00: ATA-10: CT1000MX500SSD4, M3CR020, max UDMA/133
[    0.474370] ata9.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
[    0.474569] ata9.00: READ LOG DMA EXT failed, trying PIO
[    0.474571] ata9.00: failed to get Identify Device Data, Emask 0x40
[    0.474572] ata9.00: ATA Identify Device Log not supported
[    0.474573] ata9.00: Security Log not supported
[    0.474576] ata9.00: failed to set xfermode (err_mask=0x40)
[    0.493193] sched_clock: Marking stable (504159603, -10991650)->(615393192, -122225239)
[    0.493768] registered taskstats version 1
[    0.493770] Loading compiled-in X.509 certificates
[    0.497894] Loaded X.509 cert 'Build time autogenerated kernel key: 4682756457c0b7b167f0b726943cf53ade93bd78'
[    0.497984] zswap: loaded using pool lzo/zbud
[    0.498237] kmemleak: Kernel memory leak detector initialized
[    0.498243] kmemleak: Automatic memory scanning thread started
[    0.506469] Key type big_key registered
[    0.508864]   Magic number: 11:580:189
[    0.508914] acpi device:48: hash matches
[    0.508922] acpi device:1b: hash matches
[    0.509017] rtc_cmos 00:02: setting system clock to 2019-03-29T17:09:41 UTC (1553879381)
[    0.509485] After kernel_init_freeable
[    1.364358] tsc: Refined TSC clocksource calibration: 3499.982 MHz
[    1.364370] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x32734013f67, max_idle_ns: 440795323134 ns
[    1.364419] clocksource: Switched to clocksource tsc
[    5.593141] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 0)
[    5.593270] ata9.00: supports DRM functions and may not be fully accessible
[    5.594006] ata9.00: supports DRM functions and may not be fully accessible
[    5.594600] ata9.00: configured for UDMA/133
[    5.595893] scsi 8:0:0:0: Direct-Access     ATA      CT1000MX500SSD4  020  PQ: 0 ANSI: 5
[    5.596611] sd 8:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
[    5.596613] sd 8:0:0:0: [sda] 4096-byte physical blocks
[    5.596626] sd 8:0:0:0: [sda] Write Protect is off
[    5.596629] sd 8:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    5.596650] sd 8:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    5.596837] sd 8:0:0:0: Attached scsi generic sg0 type 0
[    5.597329]  sda: sda1 sda2 sda3
[    5.599079] sd 8:0:0:0: [sda] Attached SCSI disk

This happens with all Linux kernel versions – tested since 4.14. 
Searching for this error on the Web, several people seem to be affected 
with different boards and drives [1].

In a lot of those reports, I also see the message *Security Log not 
supported*.

Do you have an idea, why this is only happening sometimes? Is it likely 
a drive firmware issue? As it affects several vendors, can Linux work 
around this problem (if it can be pinpointed)?

The first error seems to be `READ LOG DMA EXT failed, trying PIO`. The 
code looks like below [2].

retry:
        ata_tf_init(dev, &tf);
        if (dev->dma_mode && ata_id_has_read_log_dma_ext(dev->id) &&
            !(dev->horkage & ATA_HORKAGE_NO_DMA_LOG)) {
                tf.command = ATA_CMD_READ_LOG_DMA_EXT;
                tf.protocol = ATA_PROT_DMA;
                dma = true;
        } else {
                tf.command = ATA_CMD_READ_LOG_EXT;
                tf.protocol = ATA_PROT_PIO;
                dma = false;
        }
        tf.lbal = log;
        tf.lbam = page;
        tf.nsect = sectors;
        tf.hob_nsect = sectors >> 8;
        tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_LBA48 | ATA_TFLAG_DEVICE;

        err_mask = ata_exec_internal(dev, &tf, NULL, DMA_FROM_DEVICE,
                                     buf, sectors * ATA_SECT_SIZE, 0);

        if (err_mask && dma) {
                dev->horkage |= ATA_HORKAGE_NO_DMA_LOG;
                ata_dev_warn(dev, "READ LOG DMA EXT failed, trying PIO\n");
                goto retry;
        }

        DPRINTK("EXIT, err_mask=%x\n", err_mask);
        return err_mask;

So when trying PIO, that also seems to fail, and the error mask 0x40 is 
propagated up.

enum ata_completion_errors {
        AC_ERR_DEV              = (1 << 0), /* device reported error */
        AC_ERR_HSM              = (1 << 1), /* host state machine violation */
        AC_ERR_TIMEOUT          = (1 << 2), /* timeout */
        AC_ERR_MEDIA            = (1 << 3), /* media error */
        AC_ERR_ATA_BUS          = (1 << 4), /* ATA bus error */
        AC_ERR_HOST_BUS         = (1 << 5), /* host bus error */
        AC_ERR_SYSTEM           = (1 << 6), /* system error */
        AC_ERR_INVALID          = (1 << 7), /* invalid argument */
        AC_ERR_OTHER            = (1 << 8), /* unknown */
        AC_ERR_NODEV_HINT       = (1 << 9), /* polling device detection hint */
        AC_ERR_NCQ              = (1 << 10), /* marker for offending NCQ qc */
};

So 0x40 is AC_ERR_HOST_BUS, right? Any idea, what might cause this?

Would it be helpful to enable the debug messages?

    #undef ATA_DEBUG                /* debugging output */

Kind regards,

Paul

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=195895[2]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c?id=1daa0449d287a109b93c4516914eddeff4baff65#n2075
[3]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/libata.h?id=1daa0449d287a109b93c4516914eddeff4baff65#n508