When a command is added to the host's error handler command queue, there is a chance that the error handler will not be woken up. This can happen when one CPU is running scsi_eh_scmd_add() at the same time as another CPU is running scsi_device_unbusy() for a different command on the same host. Each function changes one value, and then looks at the value of a variable that the other function has just changed, but if they both see stale data, neither will actually wake up the error handler. In scsi_eh_scmd_add, host_failed is incremented, then scsi_eh_wakeup() is called, which sees that host_busy is still 2, so it doesn't actually wake up the handler. Meanwhile, in scsi_device_unbusy(), host_busy is decremented, and then it sees that host_failed is 0, so it doesn't even call scsi_eh_wakeup(). Signed-off-by: Stuart Hyaes <stuart.w.hayes@xxxxxxxxx> --- diff -pur linux-4.14/drivers/scsi/scsi_error.c linux-4.14-stu/drivers/scsi/scsi_error.c --- linux-4.14/drivers/scsi/scsi_error.c 2017-11-12 12:46:13.000000000 -0600 +++ linux-4.14-stu/drivers/scsi/scsi_error.c 2017-11-17 14:22:19.230867923 -0600 @@ -243,6 +243,10 @@ void scsi_eh_scmd_add(struct scsi_cmnd * scsi_eh_reset(scmd); list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q); shost->host_failed++; + /* + * See scsi_device_unbusy() for explanation of smp_mb(). + */ + smp_mb(); scsi_eh_wakeup(shost); spin_unlock_irqrestore(shost->host_lock, flags); } diff -pur linux-4.14/drivers/scsi/scsi_lib.c linux-4.14-stu/drivers/scsi/scsi_lib.c --- linux-4.14/drivers/scsi/scsi_lib.c 2017-11-12 12:46:13.000000000 -0600 +++ linux-4.14-stu/drivers/scsi/scsi_lib.c 2017-11-17 14:22:15.814867833 -0600 @@ -325,6 +325,15 @@ void scsi_device_unbusy(struct scsi_devi unsigned long flags; atomic_dec(&shost->host_busy); + + /* This function changes host_busy and looks at host_failed, while + * scsi_eh_scmd_add() updates host_failed and looks at host_busy (in + * scsi_eh_wakeup())... if these happen simultaneously without the smp + * memory barrier, each can see the old value, such that neither will + * wake up the error handler, which can cause the host controller to + * be hung forever. + */ + smp_mb(); if (starget->can_queue > 0) atomic_dec(&starget->target_busy); --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus