Hi, Bart, Thank you very much for the quick response. But I'm not using mq, and I run into these two problems in a non-mq system. The patch you pointed out is fix for mq, so I don't think it can resolve this problem. IIUC, mq is for SSD ? I'm not using ssd, so mq is disabled. On Mon, 2017-08-14 at 11:23 +0000, Tangchen (UVP) wrote: > Problem 2: > > *************** > [What it looks like] > *************** > When remove a scsi device, and the network error happens, __blk_drain_queue() could hang forever. > > # cat /proc/19160/stack > [<ffffffff8005886d>] msleep+0x1d/0x30 > [<ffffffff80201a84>] __blk_drain_queue+0xe4/0x160 [<ffffffff80202766>] > blk_cleanup_queue+0x106/0x2e0 [<ffffffffa000fb02>] > __scsi_remove_device+0x52/0xc0 [scsi_mod] [<ffffffffa000fb9b>] > scsi_remove_device+0x2b/0x40 [scsi_mod] [<ffffffffa000fbc0>] > sdev_store_delete_callback+0x10/0x20 [scsi_mod] [<ffffffff801a4e75>] > sysfs_schedule_callback_work+0x15/0x80 > [<ffffffff80062d69>] process_one_work+0x169/0x340 [<ffffffff800667e3>] > worker_thread+0x183/0x490 [<ffffffff8006a526>] kthread+0x96/0xa0 > [<ffffffff8041ebb4>] kernel_thread_helper+0x4/0x10 > [<ffffffffffffffff>] 0xffffffffffffffff > > The request queue of this device was stopped. So the following check will be true forever: > __blk_run_queue() > { > if (unlikely(blk_queue_stopped(q))) > return; > > __blk_run_queue_uncond(q); > } > > So __blk_run_queue_uncond() will never be called, and the process hang. > > [ ... ] > > **************** > [How to reproduce] > **************** > Unfortunately I cannot reproduce it in the latest kernel. > The script below will help to reproduce, but not very often. > > # create network error > tc qdisc add dev eth1 root netem loss 60% > > # restart iscsid and rescan scsi bus again and again while [ 1 ] do > systemctl restart iscsid > rescan-scsi-bus (http://manpages.ubuntu.com/manpages/trusty/man8/rescan-scsi-bus.8.html) > done This should have been fixed by commit 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck"). The first mainline kernel that includes this commit is kernel v4.11. > void __blk_run_queue(struct request_queue *q) { > - if (unlikely(blk_queue_stopped(q))) > + if (unlikely(blk_queue_stopped(q)) && > + unlikely(!blk_queue_dying(q))) > return; > > __blk_run_queue_uncond(q); Are you aware that the single queue block layer is on its way out and will be removed sooner or later? Please focus your testing on scsi-mq. Regarding the above patch: it is wrong because it will cause lockups during path removal for other block drivers. Please drop this patch. Bart.