On Thu, Oct 20, 2022 at 05:10:13PM +0800, Ming Lei wrote: > @@ -1593,10 +1598,17 @@ static void blk_mq_timeout_work(struct work_struct *work) > if (!percpu_ref_tryget(&q->q_usage_counter)) > return; > > - blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next); > + /* Before walking tags, we must ensure any submit started before the > + * current time has finished. Since the submit uses srcu or rcu, wait > + * for a synchronization point to ensure all running submits have > + * finished > + */ > + blk_mq_wait_quiesce_done(q); > + > + blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &expired); The blk_mq_wait_quiesce_done() will only wait for tasks that entered just before calling that function. It will not wait for tasks that entered immediately after. If I correctly understand the problem you're describing, the hypervisor may prevent any guest process from running. If so, the timeout work may be stalled after the quiesce, and if a queue_rq() process also stalled after starting quiesce_done(), then we're in the same situation you're trying to prevent, right? I agree with your idea that this is a lower level driver responsibility: it should reclaim all started requests before allowing new queuing. Perhaps the block layer should also raise a clear warning if it's queueing a request that's already started.