On Mon, 2018-06-11 at 14:59 -0700, Ted Cabeen wrote: > On 06/11/2018 02:40 PM, James Bottomley wrote: > > On Mon, 2018-06-11 at 12:20 -0400, Douglas Gilbert wrote: > > > I have also seen Aborted Command sense when doing heavy testing > > > on one or more SAS disks behind a SAS expander. I put it down to > > > a temporary lack of paths available (on the link between the > > > host's HBA and the expander) when one of those SAS disks tries to > > > get a connection back to the host with the data (data-in > > > transfer) from an earlier READ command. > > > > > > In my code (ddpt and sg_dd) I treat it as a "retry" type error > > > and in my experience that works. IOW a follow-up READ with the > > > same parameters is successful. > > > > We do treat ABORTED_COMMAND as a retry. However, it will tick down > > the retry count (usually 3) and then fail if it still occurs. How > > long does this condition persist for? because if it's long lived we > > could treat it as ADD_TO_MLQUEUE which would mean we'd retry until > > the timeout condition was reached. > > On my system, it's a bit hard to tell, as as soon as ZFS sees the > read error, it starts resilvering to repair the sector that reported > the I/O error. Without the scrub, it happened once over a 5-day > window. During the scrub, it was usually 10s of minutes between > occurrences that failed all the retries, but I had some occasions > where it happened about 5-10 minutes apart. It definitely seems to > be load-related, so how long and hard the load stays elevated is a > factor. OK, try this: it will print a rate limited warning if it triggers (showing it is this problem) and return ADD_TO_MLQUEUE for all the SAS errors (we'll likely narrow this if it works, but for now let's do the lot). James --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 8932ae81a15a..94aa5cb94064 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -531,6 +531,11 @@ int scsi_check_sense(struct scsi_cmnd *scmd) if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 && sdev->sdev_bflags & BLIST_RETRY_ASC_C1) return ADD_TO_MLQUEUE; + if (sshdr.asc == 0x4b) { + printk_ratelimited(KERN_WARNING "SAS/SATA link retry\n"); + return ADD_TO_MLQUEUE; + } + return NEEDS_RETRY; case NOT_READY: