Re: [PATCH v2] scsi_sysfs: fix hang when removing scsi device

Israel Rukshin <israelr@xxxxxxxxxxxx> · Thu, 16 Mar 2017 11:02:02 +0200

Hi Bart,

I tested your patches and the hang disappeared when fast_io_fail_tmo 
expired.
The warning from patch 1 still exist, so we need an additional fix for that.

Regards,
Israel

On 3/16/2017 1:27 AM, Bart Van Assche wrote:
On Tue, 2017-03-14 at 16:23 +0200, Israel Rukshin wrote:
Patch number 5 doesn't handle the case when device_for_each_child() is
called. device_for_each_child() calls to target_unblock() that uses also
starget_for_each_device(). After applying also the following change the
hang disappeared but it didn't fix the warning. Those fixes are not enough
because if fast_io_fail_tmo is set to infinity then the hang will remain,
because only __rport_fail_io_fast() calls to scsi_target_unblock() and
terminate_rport_io() that free the sync cache command.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e5d4b50..09f9566 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -3068,9 +3068,15 @@ void scsi_device_resume(struct scsi_device *sdev)
   static int
   target_unblock(struct device *dev, void *data)
   {
-       if (scsi_is_target_device(dev))
-               starget_for_each_device(to_scsi_target(dev), data,
-                                       device_unblock);
+       if (scsi_is_target_device(dev)) {
+               struct scsi_target *starget = to_scsi_target(dev);
+               struct Scsi_Host *shost = dev_to_shost(dev->parent);
+               unsigned long flags;
+
+               spin_lock_irqsave(shost->host_lock, flags);
+               __starget_for_each_device(starget, data, device_unblock);
+               spin_unlock_irqrestore(shost->host_lock, flags);
+       }
          return 0;
   }
Hello Israel,

Regarding setting fast_io_fail_tmo to infinity: that does not prevent
kernel module unloading because srp_timed_out() stops resetting the
timer as soon as the SCSI device is unblocked. The above patch should
realize that but suffers from the same issue as a patch attached to
my previous e-mail, namely lock inversion. For at least the following
call chain the block layer queue lock is the outer lock and the SCSI
host lock is the inner lock:
ata_qc_schedule_eh()
-> blk_abort_request()
   -> blk_mq_rq_timed_out()
     -> scsi_timeout()
       -> scsi_times_out()
         -> scsi_eh_scmd_add()

So I think we should avoid introducing code with the SCSI host lock
as outer lock and the block layer queue lock as inner lock. How about
the attached four patches?

Thanks,

Bart.