It is possible for two threads to be running blk_mq_do_dispatch_sched() at the same time with the same "hctx". This is because there can be more than one caller to __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't prevent more than one thread from entering. If more than one thread is running blk_mq_do_dispatch_sched() at the same time with the same "hctx", they may have contention acquiring budget. The blk_mq_get_dispatch_budget() can eventually translate into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not uncommon) then only one of the two threads will be the one to increment "device_busy" to 1 and get the budget. The losing thread will break out of blk_mq_do_dispatch_sched() and will stop dispatching requests. The assumption is that when more budget is available later (when existing transactions finish) the queue will be kicked again, perhaps in scsi_end_request(). The winning thread now has budget and can go on to call dispatch_request(). If dispatch_request() returns NULL here then we have a potential problem. Specifically we'll now call blk_mq_put_dispatch_budget() which translates into scsi_mq_put_budget(). That will mark the device as no longer busy but doesn't do anything to kick the queue. This violates the assumption that the queue would be kicked when more budget was available. Pictorially: Thread A Thread B ================================= ================================== blk_mq_get_dispatch_budget() => 1 dispatch_request() => NULL blk_mq_get_dispatch_budget() => 0 // because Thread A marked // "device_busy" in scsi_device blk_mq_put_dispatch_budget() The above case was observed in reboot tests and caused a task to hang forever waiting for IO to complete. Traces showed that in fact two tasks were running blk_mq_do_dispatch_sched() at the same time with the same "hctx". The task that got the budget did in fact see dispatch_request() return NULL. Both tasks returned and the system went on for several minutes (until the hung task delay kicked in) without the given "hctx" showing up again in traces. Let's attempt to fix this problem by detecting if there was contention for the budget in the case where we put the budget without dispatching anything. If we saw contention we kick all hctx's associated with the queue where there was contention. We do this without any locking by adding a double-check for budget and accepting a small amount of faux contention if the 2nd check gives us budget but then we don't dispatch anything (we'll look like we contended with ourselves). A few extra notes: - This whole thing is only a problem due to the inexact nature of has_work(). Specifically if has_work() always guaranteed that a "true" return meant that dispatch_request() would return non-NULL then we wouldn't have this problem. That's because we only grab the budget if has_work() returned true. If we had the non-NULL guarantee then at least one of the threads would actually dispatch work and when the work was done then queues would be kicked normally. - One specific I/O scheduler that trips this problem quite a bit is BFQ which definitely returns "true" for has_work() in cases where it wouldn't dispatch. Making BFQ's has_work() more exact requires that has_work() becomes a much heavier function, including figuring out how to acquire spinlocks in has_work() without tripping circular lock dependencies. This is prototyped but it's unclear if it's really the way to go when the problem can be solved with a relatively lightweight contention detection mechanism. - Because this problem only trips with inexact has_work() it's believed that blk_mq_do_dispatch_ctx() does not need a similar change. Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx> --- Changes in v2: - Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...") block/blk-mq-sched.c | 26 ++++++++++++++++++++++++-- block/blk-mq.c | 3 +++ include/linux/blkdev.h | 2 ++ 3 files changed, 29 insertions(+), 2 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 74cedea56034..0195d75f5f96 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) break; - if (!blk_mq_get_dispatch_budget(hctx)) - break; + + if (!blk_mq_get_dispatch_budget(hctx)) { + /* + * We didn't get budget so set contention. If + * someone else had the budget but didn't dispatch + * they'll kick everything. NOTE: we check one + * extra time _after_ setting contention to fully + * close the race. If we don't actually dispatch + * we'll detext faux contention (with ourselves) + * but that should be rare. + */ + atomic_set(&q->budget_contention, 1); + + if (!blk_mq_get_dispatch_budget(hctx)) + break; + } rq = e->type->ops.dispatch_request(hctx); if (!rq) { blk_mq_put_dispatch_budget(hctx); + + /* + * We've released the budget but us holding it might + * have prevented someone else from dispatching. + * Detect that case and run all queues again. + */ + if (atomic_read(&q->budget_contention)) + blk_mq_run_hw_queues(q, true); break; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 2cd8d2b49ff4..6163c43ceca5 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) struct blk_mq_hw_ctx *hctx; int i; + /* We're running the queues, so clear the contention detector */ + atomic_set(&q->budget_contention, 0); + queue_for_each_hw_ctx(q, hctx, i) { if (blk_mq_hctx_stopped(hctx)) continue; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f629d40c645c..07f21e45d993 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -583,6 +583,8 @@ struct request_queue { #define BLK_MAX_WRITE_HINTS 5 u64 write_hints[BLK_MAX_WRITE_HINTS]; + + atomic_t budget_contention; }; #define QUEUE_FLAG_STOPPED 0 /* queue is stopped */ -- 2.26.0.rc2.310.g2932bb562d-goog