The bug reproduce when unloading srp module with one port down. sd_shutdown() hangs when __scsi_remove_device() get scsi_device with state SDEV_OFFLINE or SDEV_TRANSPORT_OFFLINE. It hangs because sd_shutdown() is trying to send sync cache command when the device is offline but with SDEV_CANCEL status. The status was changed to SDEV_CANCEL by __scsi_remove_device() before it calls to device_del(). The block layer timeout mechanism doesn't cause the SYNCHRONIZE CACHE command to fail after the timeout expired because the request timer wasn't started. blk_peek_request() that is called from scsi_request_fn() didn't return this request and therefore the request timer didn't start. This commit doesn't accept new commands if the original state was offline. The bug was revealed after commit cff549 ("scsi: proper state checking and module refcount handling in scsi_device_get"). After this commit scsi_device_get() returns error if the device state is SDEV_CANCEL. This eventually leads SRP fast I/O failure timeout handler not to clean the sync cache command because scsi_target_unblock() skip the canceled device. If this timeout handler is set to infinity then the hang remains forever also before commit cff549. sysrq: SysRq : sysrq: Show Blocked State task PC stack pid father kworker/2:0 D ffff88046fa95c00 0 21178 2 0x00000000 Workqueue: srp_remove srp_remove_work [ib_srp] Call Trace: [<ffffffff815dd985>] schedule+0x35/0x80 [<ffffffff815e02c7>] schedule_timeout+0x237/0x2d0 [<ffffffff815dcf46>] io_schedule_timeout+0xa6/0x110 [<ffffffff815de2f3>] wait_for_completion_io+0xa3/0x110 [<ffffffff812e66ff>] blk_execute_rq+0xdf/0x120 [<ffffffffa00135de>] scsi_execute+0xce/0x150 [scsi_mod] [<ffffffffa001548f>] scsi_execute_req_flags+0x8f/0xf0 [scsi_mod] [<ffffffffa0154849>] sd_sync_cache+0xa9/0x190 [sd_mod] [<ffffffffa0154c3a>] sd_shutdown+0x6a/0x100 [sd_mod] [<ffffffffa0154d34>] sd_remove+0x64/0xc0 [sd_mod] [<ffffffff8144d3fd>] __device_release_driver+0x8d/0x120 [<ffffffff8144d4ae>] device_release_driver+0x1e/0x30 [<ffffffff8144c239>] bus_remove_device+0xf9/0x170 [<ffffffff81448bc7>] device_del+0x127/0x240 [<ffffffffa001c0f1>] __scsi_remove_device+0xc1/0xd0 [scsi_mod] [<ffffffffa001a5d7>] scsi_forget_host+0x57/0x60 [scsi_mod] [<ffffffffa000e3d2>] scsi_remove_host+0x72/0x110 [scsi_mod] [<ffffffffa06f95ab>] srp_remove_work+0x8b/0x200 [ib_srp] ... Signed-off-by: Israel Rukshin <israelr@xxxxxxxxxxxx> Reviewed-by: Max Gurtovoy <maxg@xxxxxxxxxxxx> --- Changes from v1: - add extra description to commit message and to the comment. - refer to the commit that originally introduced this hang. --- drivers/scsi/scsi_sysfs.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 82dfe07..8a977f5 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -1282,6 +1282,8 @@ void __scsi_remove_device(struct scsi_device *sdev) return; if (sdev->is_visible) { + enum scsi_device_state oldstate = sdev->sdev_state; + if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0) return; @@ -1289,6 +1291,17 @@ void __scsi_remove_device(struct scsi_device *sdev) device_unregister(&sdev->sdev_dev); transport_remove_device(dev); scsi_dh_remove_device(sdev); + + /* + * Fail new requests if the old state was offline. + * This avoids from sd_shutdown() to hang. + * The SYNCHRONIZE CACHE request timer will never start + * in that case. + */ + if (oldstate == SDEV_TRANSPORT_OFFLINE || + oldstate == SDEV_OFFLINE) + blk_set_queue_dying(sdev->request_queue); + device_del(dev); } else put_device(&sdev->sdev_dev); -- 2.4.3