ATA link drop-outs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello.. Been chasing an issue with ATA link drop-outs and wanted to
run this by some SMEs.
System Information
Distro: AlmaLinux 8.8
Kernel: 4.18.0-477.13.1
Arch: x64
OpenZFS Version: 2.1.5-1

The dropouts are occurring with SSD drives that are attached to
Marvell 88SE9235 SATA controllers via Marvell 88SM9705 port
multipliers. The SSD drives are M.2 form factor and are typically
models from WD or SanDisk. When the issue occurs, communication with
all SSD drives (5) connected to port multiplier is lost and the driver
performs recovery steps in order to re-establish connection with the
SSD drives. This results in ZFS I/O errors being reported from zpool
status. Multiple events with unsuccessful recovery steps by driver can
lead to pool suspension.

The issue occurs with both small and large I/O workloads, though
usually takes longer to manifest with small I/O workload.

The issue DOES NOT occur with older version of CentOS and ZFS running
on same hardware.
System Information
Distribution: CentOS 7.9
Kernel Version: 3.10.0-1160.15.2
Architecture: x64
OpenZFS Version: 0.8.6-1

Have tried the following, in different combinations but issue still occurs:
Disabling NCQ
Lowering SATA speed to 3.0
Upgrading ZFS to 2.1.13
Upgrading to AlmaLinux 8.9
Changing SATA power management from max_performance -> medium_power
Changing I/O scheduler from None -> mq-deadline
Change max_sectors_kb -> 512

The issue can be reproduced as follows:
Small I/O workload: Boot-up system w/ apps that generate small
sustained I/O load on the ZFS pool and let it run w/o interaction
Large I/O workload: Use fio to generate heavy I/O workload on ZFS pool

Partial snippet from syslog that shows initial messages when drop-outs occur:
Dec 17 07:41:00.384 test01 kernel: ata7.00: failed to read SCR 1 (Emask=0x40)
Dec 17 07:41:00.384 test01 kernel: ata7.01: failed to read SCR 1 (Emask=0x40)
Dec 17 07:41:00.384 test01 kernel: ata7.02: failed to read SCR 1 (Emask=0x40)
Dec 17 07:41:00.384 test01 kernel: ata7.03: failed to read SCR 1 (Emask=0x40)
Dec 17 07:41:00.384 test01 kernel: ata7.04: failed to read SCR 1 (Emask=0x40)
Dec 17 07:41:00.384 test01 kernel: ata7.00: exception Emask 0x100 SAct
0x4200000 SErr 0x0 action 0x6 frozen
Dec 17 07:41:00.384 test01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Dec 17 07:41:00.384 test01 kernel: ata7.00: cmd
61/0b:a8:da:66:d1/00:00:08:00:00/40 tag 21 ncq dma 5632 out
         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 17 07:41:00.384 test01 kernel: ata7.00: status: { DRDY }
Dec 17 07:41:00.384 test01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Dec 17 07:41:00.384 test01 kernel: ata7.00: cmd
61/15:d0:28:26:fe/00:00:06:00:00/40 tag 26 ncq dma 10752 out
         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 17 07:41:00.384 test01 kernel: ata7.00: status: { DRDY }

Any input on this would be greatly appreciated!




[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux