On Wed, 2018-06-13 at 16:04 +0200, hch@xxxxxx wrote: > > I suspect this is due to we could expire a same request twice or even more. > > For scsi mid-layer, it return BLK_EH_DONE from .timeout, in fact, the request is not > > completed there, but just queue a delayed abort_work (HZ/100). If the blk_mq_timeout_work > > runs again before the abort_work, the request will be timed out again, because there is not > > any mark on it to identify this request has been timed out. > > > > Would please try the patch attached on to see whether this issue could be fixed ? > > (this patch only works for scsi device currently) > > The patch isn't really going to work without a caller of your new > __blk_mq_complete_request helper, is it? __blk_mq_complete_request() is already called today by blk_mq_complete_request(). However, it's not clear to me why that function is exported by Jianchao's patch. > Either way the concept of doing error handling without quiescing the > queue just looks bogus to me and will end up with some sort of race > here or there. The SCSI error handler already waits until all pending requests have finished before it starts handling timed out commands. This e-mail thread started with a report of a crash in the SCSI error handler, which is a regression introduced in the v4.18 merge window. Bart.