Hi,
I am currently facing an issue with a Broadcom HBA 9500-8i SAS
controller where 'blkdiscard /dev/sdX' on WD SA500 SATA SSDs cause an IO
timeout and device reset.
* LSI/Broadcom HBA 9500-8i SAS/SATA controller
* WD RED SA500 NAS SATA SSD 2TB (WDS200T1R0A-68A4W0)
Drive FW: 411000WR
* Alpine Linux kernel 5.15.48
* /sys/block/sdf/queue/
discard_granularity:512
discard_max_bytes:134217216
discard_max_hw_bytes:134217216
I simply issue a 'blkdiscard /dev/sdf' and after about 30 seconds the
following errors show in dmesg (quite a lot of rows). The full
blkdiscard takes between 1.5 and 2.5 minutes depending on the SSD I run
on (I have 4 drives). The problem is the same if I run fstrim on a
mounted XFS or Btrfs (but not ext4) filesystem on these drives.
[ +0.000003] scsi target6:0:4: handle(0x0029),
sas_address(0x5003048020db4543), phy(3)
[ +0.000003] scsi target6:0:4: enclosure logical
id(0x5003048020db457f), slot(3)
[ +0.000003] scsi target6:0:4: enclosure level(0x0000), connector name(
C0.1)
[ +0.000003] sd 6:0:4:0: No reference found at driver, assuming
scmd(0x00000000eb0d9438) might have completed
[ +0.000003] sd 6:0:4:0: task abort: SUCCESS scmd(0x00000000eb0d9438)
[ +0.000012] sd 6:0:4:0: attempting task
abort!scmd(0x0000000075f63919), outstanding for 30397 ms & timeout 30000 ms
[ +0.000003] sd 6:0:4:0: [sdg] tag#2762 CDB: opcode=0x42 42 00 00 00 00
00 00 00 18 00
[ +0.000002] scsi target6:0:4: handle(0x0029),
sas_address(0x5003048020db4543), phy(3)
[ +0.000004] scsi target6:0:4: enclosure logical
id(0x5003048020db457f), slot(3)
[ +0.000002] scsi target6:0:4: enclosure level(0x0000), connector name(
C0.1)
[ +0.000003] sd 6:0:4:0: No reference found at driver, assuming
scmd(0x0000000075f63919) might have completed
[ +0.000003] sd 6:0:4:0: task abort: SUCCESS scmd(0x0000000075f63919)
[ +0.255021] sd 6:0:4:0: Power-on or device reset occurred
Does the mpt3sas driver or the HBA controller not follow the
/sys/block/<dev>/device/timeout value? I have mine set to 180 seconds.
It seems that there are many hardcoded timeout values in the driver code.
https://github.com/torvalds/linux/blob/master/drivers/scsi/mpt3sas/mpt3sas_scsih.c
https://github.com/torvalds/linux/blob/6a0a17e6c6d1091ada18d43afd87fb26a82a9823/drivers/scsi/mpt3sas/mpt3sas_scsih.c#L3303-L3306
Any thoughts other than trying to avoid using discards/fstrim? I did
reach out to Broadcom for support, and they claim it is a fault in the
fstrim code and that on FreeBSD they had fixed this. Not sure how
relevant that statement is though.
Thanks,
Forza