On 8/8/22 3:20 PM, mwilck@xxxxxxxx wrote: > From: Martin Wilck <mwilck@xxxxxxxx> > > The SCSI mid layer doesn't retry commands after DID_TIME_OUT (see > scsi_noretry_cmd()). Packet loss in the fabric can cause spurious timeouts > during SCSI device probing, causing device probing to fail. This has been > observed in FCoE uplink failover tests, for example. What about the other scan/probe related commands and other transient transport errors like this (so when we get to the point DID_TRANSPORT_DISRUPTED is returned)? I think if you changed your test a little so the fc port state changed, we could still hit the same end problem. We can hit similar errors with iscsi and plain old FC. For REPORT_LUNS it looks like we retry almost all errors 3 times. For the probe/setup commands, at least for disks, it looks like we also are more forgiving and will retry DID_TIME_OUT/DID_TRANSPORT_DISRUPTED 3 times for commands like SAI_READ_CAPACITY_16 (I didn't check every sd operation and other upper level drivers). However, for the other probe/setup operations that rely on scsi_attach_vpd succeeding like sd_read_block_limits then we will hit issues where the device is partially setup. Should scsi_vpd_inquiry be retrying 3 times as well? An alternative to changing all the callers would be we could make scsi_noretry_cmd detect when it's an internal passthrough command and just retry these types of errors. For SG IO type of passthough we still want to fail right away. > > This patch fixes the issue by retrying the INQUIRY up to 3 times (in practice, > we never observed more than a single retry), > > Signed-off-by: Martin Wilck <mwilck@xxxxxxxx> > Tested-by: Dave Prizer <dave.prizer@xxxxxxx> > > --- > This patch was previously part of the series "Fixes for device probing > on flaky connections", submitted on 2022/06/15. The first patch of the > series has been dropped as discussed in the review process. Testing > verified that just this patch was sufficient to solve the observed > issues. > > --- > drivers/scsi/scsi_scan.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c > index 91ac901a66826..e859a648033f9 100644 > --- a/drivers/scsi/scsi_scan.c > +++ b/drivers/scsi/scsi_scan.c > @@ -697,6 +697,11 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result, > (sshdr.ascq == 0)) > continue; > } > + if (host_byte(result) == DID_TIME_OUT) { > + SCSI_LOG_SCAN_BUS(3, sdev_printk(KERN_INFO, sdev, > + "scsi scan: retry inquiry after timeout\n")); > + continue; > + } > } else if (result == 0) { > /* > * if nothing was transferred, we try Should there