On Fri, 2018-03-16 at 15:00 -0700, James Bottomley wrote: > On Fri, 2018-03-16 at 10:40 -0700, Bart Van Assche wrote: > > @@ -1050,7 +1050,22 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd > > *scmd, unsigned char *cmnd, > > > > scsi_log_send(scmd); > > scmd->scsi_done = scsi_eh_done; > > - rtn = shost->hostt->queuecommand(shost, scmd); > > + mutex_lock(&sdev->state_mutex); > > + while (sdev->sdev_state == SDEV_QUIESCE && timeleft > 0) { > > + mutex_unlock(&sdev->state_mutex); > > + SCSI_LOG_ERROR_RECOVERY(5, sdev_printk(KERN_DEBUG, > > sdev, > > + "%s: state %d <> %d\n", __func__, sdev- > > > sdev_state, > > > > + SDEV_QUIESCE)); > > + delay = min(timeleft, stall_for); > > + timeleft -= delay; > > + msleep(jiffies_to_msecs(delay)); > > + mutex_lock(&sdev->state_mutex); > > + } > > What's the point of this loop? if you eliminate it, you still get > exactly the same msleep from the stall_for retry processing. Hello James, The purpose of that loop is to check the SCSI device state every "stall_for" jiffies and to avoid that more than "timeleft" jiffies is spent on waiting. > Plus I really don't think you want to call ->queuecommand() with the > state mutex held. Since .queuecommand() is not allowed to sleep I think that holding the SCSI device state mutex does not introduce the possibility of a new deadlock. What makes you think that calling ->queuecommand() with that mutex held wouldn't be safe? > You don't even need to hold the state mutex to read sdev->state because > the read is atomic and the mutex doesn't mediate anything. The check to > queuecommand race is the same for every consumer. Since both scsi_device_quiesce() and scsi_device_resume() acquire and release the SCSI device .state_mutex, obtaining that mutex realizes serialization of the above code against scsi_device_quiesce() and scsi_device_resume(). If the above code would not obtain .state_mutex then the following race would be possible: * scsi_send_eh_cmnd() sees that .sdev_state == SDEV_QUIESCE and decides that it is safe to call .queuecommand(). * Another thread calls scsi_device_quiesce() and removes some of the resources needed by .queuecommand(). * scsi_send_eh_cmnd() calls .queuecommand(), resulting in a kernel oops. Please let me know if you need more information. Thanks, Bart.