[PATCH v2] scsi_sysfs: fix hang when removing scsi device

Israel Rukshin <israelr@xxxxxxxxxxxx> · Thu, 9 Mar 2017 18:37:22 +0200

The bug reproduce when unloading srp module with one port down.
sd_shutdown() hangs when __scsi_remove_device() get scsi_device with
state SDEV_OFFLINE or SDEV_TRANSPORT_OFFLINE.
It hangs because sd_shutdown() is trying to send sync cache command
when the device is offline but with SDEV_CANCEL status.
The status was changed to SDEV_CANCEL by __scsi_remove_device()
before it calls to device_del().

The block layer timeout mechanism doesn't cause the SYNCHRONIZE CACHE
command to fail after the timeout expired because the request timer
wasn't started.
blk_peek_request() that is called from scsi_request_fn() didn't return
this request and therefore the request timer didn't start.

This commit doesn't accept new commands if the original state was offline.

The bug was revealed after commit cff549 ("scsi: proper state checking
and module refcount handling in scsi_device_get").
After this commit scsi_device_get() returns error if the device state
is SDEV_CANCEL.
This eventually leads SRP fast I/O failure timeout handler not to clean
the sync cache command because scsi_target_unblock() skip the canceled device.
If this timeout handler is set to infinity then the hang remains forever
also before commit cff549.

sysrq: SysRq : sysrq: Show Blocked State
task PC stack pid father
kworker/2:0 D ffff88046fa95c00 0 21178 2 0x00000000
Workqueue: srp_remove srp_remove_work [ib_srp]
Call Trace:
[<ffffffff815dd985>] schedule+0x35/0x80
[<ffffffff815e02c7>] schedule_timeout+0x237/0x2d0
[<ffffffff815dcf46>] io_schedule_timeout+0xa6/0x110
[<ffffffff815de2f3>] wait_for_completion_io+0xa3/0x110
[<ffffffff812e66ff>] blk_execute_rq+0xdf/0x120
[<ffffffffa00135de>] scsi_execute+0xce/0x150 [scsi_mod]
[<ffffffffa001548f>] scsi_execute_req_flags+0x8f/0xf0 [scsi_mod]
[<ffffffffa0154849>] sd_sync_cache+0xa9/0x190 [sd_mod]
[<ffffffffa0154c3a>] sd_shutdown+0x6a/0x100 [sd_mod]
[<ffffffffa0154d34>] sd_remove+0x64/0xc0 [sd_mod]
[<ffffffff8144d3fd>] __device_release_driver+0x8d/0x120
[<ffffffff8144d4ae>] device_release_driver+0x1e/0x30
[<ffffffff8144c239>] bus_remove_device+0xf9/0x170
[<ffffffff81448bc7>] device_del+0x127/0x240
[<ffffffffa001c0f1>] __scsi_remove_device+0xc1/0xd0 [scsi_mod]
[<ffffffffa001a5d7>] scsi_forget_host+0x57/0x60 [scsi_mod]
[<ffffffffa000e3d2>] scsi_remove_host+0x72/0x110 [scsi_mod]
[<ffffffffa06f95ab>] srp_remove_work+0x8b/0x200 [ib_srp]
...

Signed-off-by: Israel Rukshin <israelr@xxxxxxxxxxxx>
Reviewed-by: Max Gurtovoy <maxg@xxxxxxxxxxxx>
---

Changes from v1:
 - add extra description to commit message and to the comment.
 - refer to the commit that originally introduced this hang.

---
 drivers/scsi/scsi_sysfs.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 82dfe07..8a977f5 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1282,6 +1282,8 @@ void __scsi_remove_device(struct scsi_device *sdev)
 		return;
 
 	if (sdev->is_visible) {
+		enum scsi_device_state oldstate = sdev->sdev_state;
+
 		if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
 			return;
 
@@ -1289,6 +1291,17 @@ void __scsi_remove_device(struct scsi_device *sdev)
 		device_unregister(&sdev->sdev_dev);
 		transport_remove_device(dev);
 		scsi_dh_remove_device(sdev);
+
+		/*
+		 * Fail new requests if the old state was offline.
+		 * This avoids from sd_shutdown() to hang.
+		 * The SYNCHRONIZE CACHE request timer will never start
+		 * in that case.
+		 */
+		if (oldstate == SDEV_TRANSPORT_OFFLINE ||
+		    oldstate == SDEV_OFFLINE)
+			blk_set_queue_dying(sdev->request_queue);
+
 		device_del(dev);
 	} else
 		put_device(&sdev->sdev_dev);
-- 
2.4.3