blk_On Fri, Oct 21, 2022 at 10:40 AM Keith Busch <kbusch@xxxxxxxxxx> wrote: > > The blk_mq_wait_quiesce_done() will only wait for tasks that entered > just before calling that function. It will not wait for tasks that > entered immediately after. Right. The design was depending on saving a jiffies value before the blk_mq_wait_quiesce_done() call and deadline behavior to avoid having to care about queue_rq calls starting after blk_mq_wait_quiesce_done(). > If I correctly understand the problem you're describing, the hypervisor > may prevent any guest process from running. If so, the timeout work may > be stalled after the quiesce, and if a queue_rq() process also stalled > after starting quiesce_done(), then we're in the same situation you're > trying to prevent, right? For handling this condition and avoiding the issue, the code was depending on the deadline value and rq->state value. blk_mq_start_request sets a deadline then sets MQ_RQ_IN_FLIGHT, while blk_mq_req_expired checks for MQ_RQ_IN_FLIGHT then checks the deadline. A new queue_rq call would then not count as expired from these checks from either its deadline being updated to be in the future or from missing MQ_RQ_IN_FLIGHT . However, there are no memory barriers ensuring this ordering. So it looks like it should prevent an issue with a new queue_rq on x86 cpus but could race on cpus with weaker memory ordering. > I agree with your idea that this is a lower level driver responsibility: > it should reclaim all started requests before allowing new queuing. > Perhaps the block layer should also raise a clear warning if it's > queueing a request that's already started. > If I understand what you are saying, that is roughly how the old scsi timeout handling used to work before the async-abort changes. It prevented new requests and drained everything (though HBA-wide not just request_queue-wide) before doing anything with timed out commands. The current implementation makes avoiding race conditions more difficult but reduces the performance impact of a lone misbehaving request. So a driver under the block layer could potentially avoid the current races by properly quiescing the request_queue before handling any timed out requests. But I was trying to come up with a solution which avoided that requirement. David Jeffery