On Thu, Nov 09, 2017 at 09:32:58AM -0700, Jens Axboe wrote: > On 11/09/2017 08:30 AM, Jens Axboe wrote: > > On 11/09/2017 03:00 AM, Ming Lei wrote: > >> On Thu, Nov 09, 2017 at 11:41:40AM +0800, Ming Lei wrote: > >>> On Wed, Nov 08, 2017 at 03:48:51PM -0700, Jens Axboe wrote: > >>>> This patch attempts to make the case of hctx re-running on driver tag > >>>> failure more robust. Without this patch, it's pretty easy to trigger a > >>>> stall condition with shared tags. An example is using null_blk like > >>>> this: > >>>> > >>>> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1 > >>>> > >>>> which sets up 4 devices, sharing the same tag set with a depth of 1. > >>>> Running a fio job ala: > >>>> > >>>> [global] > >>>> bs=4k > >>>> rw=randread > >>>> norandommap > >>>> direct=1 > >>>> ioengine=libaio > >>>> iodepth=4 > >>>> > >>>> [nullb0] > >>>> filename=/dev/nullb0 > >>>> [nullb1] > >>>> filename=/dev/nullb1 > >>>> [nullb2] > >>>> filename=/dev/nullb2 > >>>> [nullb3] > >>>> filename=/dev/nullb3 > >>>> > >>>> will inevitably end with one or more threads being stuck waiting for a > >>>> scheduler tag. That IO is then stuck forever, until someone else > >>>> triggers a run of the queue. > >>>> > >>>> Ensure that we always re-run the hardware queue, if the driver tag we > >>>> were waiting for got freed before we added our leftover request entries > >>>> back on the dispatch list. > >>>> > >>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > >>>> > >>>> --- > >>>> > >>>> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > >>>> index 7f4a1ba532af..bb7f08415203 100644 > >>>> --- a/block/blk-mq-debugfs.c > >>>> +++ b/block/blk-mq-debugfs.c > >>>> @@ -179,7 +179,6 @@ static const char *const hctx_state_name[] = { > >>>> HCTX_STATE_NAME(STOPPED), > >>>> HCTX_STATE_NAME(TAG_ACTIVE), > >>>> HCTX_STATE_NAME(SCHED_RESTART), > >>>> - HCTX_STATE_NAME(TAG_WAITING), > >>>> HCTX_STATE_NAME(START_ON_RUN), > >>>> }; > >>>> #undef HCTX_STATE_NAME > >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c > >>>> index 3d759bb8a5bb..8dc5db40df9d 100644 > >>>> --- a/block/blk-mq.c > >>>> +++ b/block/blk-mq.c > >>>> @@ -998,49 +998,64 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, > >>>> return rq->tag != -1; > >>>> } > >>>> > >>>> -static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, int flags, > >>>> - void *key) > >>>> +static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, > >>>> + int flags, void *key) > >>>> { > >>>> struct blk_mq_hw_ctx *hctx; > >>>> > >>>> hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait); > >>>> > >>>> - list_del(&wait->entry); > >>>> - clear_bit_unlock(BLK_MQ_S_TAG_WAITING, &hctx->state); > >>>> + list_del_init(&wait->entry); > >>>> blk_mq_run_hw_queue(hctx, true); > >>>> return 1; > >>>> } > >>>> > >>>> -static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx) > >>>> +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx **hctx, > >>>> + struct request *rq) > >>>> { > >>>> + struct blk_mq_hw_ctx *this_hctx = *hctx; > >>>> + wait_queue_entry_t *wait = &this_hctx->dispatch_wait; > >>>> struct sbq_wait_state *ws; > >>>> > >>>> + if (!list_empty_careful(&wait->entry)) > >>>> + return false; > >>>> + > >>>> + spin_lock(&this_hctx->lock); > >>>> + if (!list_empty(&wait->entry)) { > >>>> + spin_unlock(&this_hctx->lock); > >>>> + return false; > >>>> + } > >>>> + > >>>> + ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx); > >>>> + add_wait_queue(&ws->wait, wait); > >>>> + > >>>> /* > >>>> - * The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait. > >>>> - * The thread which wins the race to grab this bit adds the hardware > >>>> - * queue to the wait queue. > >>>> + * It's possible that a tag was freed in the window between the > >>>> + * allocation failure and adding the hardware queue to the wait > >>>> + * queue. > >>>> */ > >>>> - if (test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) || > >>>> - test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, &hctx->state)) > >>>> + if (!blk_mq_get_driver_tag(rq, hctx, false)) { > >>>> + spin_unlock(&this_hctx->lock); > >>>> return false; > >>>> - > >>>> - init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); > >>>> - ws = bt_wait_ptr(&hctx->tags->bitmap_tags, hctx); > >>>> + } > >>>> > >>>> /* > >>>> - * As soon as this returns, it's no longer safe to fiddle with > >>>> - * hctx->dispatch_wait, since a completion can wake up the wait queue > >>>> - * and unlock the bit. > >>>> + * We got a tag, remove outselves from the wait queue to ensure > >>>> + * someone else gets the wakeup. > >>>> */ > >>>> - add_wait_queue(&ws->wait, &hctx->dispatch_wait); > >>>> + spin_lock_irq(&ws->wait.lock); > >>>> + list_del_init(&wait->entry); > >>>> + spin_unlock_irq(&ws->wait.lock); > >>>> + spin_unlock(&this_hctx->lock); > >>>> return true; > >>>> } > >>>> > >>>> bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > >>>> - bool got_budget) > >>>> + bool got_budget) > >>>> { > >>>> struct blk_mq_hw_ctx *hctx; > >>>> struct request *rq, *nxt; > >>>> + bool no_tag = false; > >>>> int errors, queued; > >>>> > >>>> if (list_empty(list)) > >>>> @@ -1060,22 +1075,15 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > >>>> if (!blk_mq_get_driver_tag(rq, &hctx, false)) { > >>>> /* > >>>> * The initial allocation attempt failed, so we need to > >>>> - * rerun the hardware queue when a tag is freed. > >>>> + * rerun the hardware queue when a tag is freed. The > >>>> + * waitqueue takes care of that. If the queue is run > >>>> + * before we add this entry back on the dispatch list, > >>>> + * we'll re-run it below. > >>>> */ > >>>> - if (!blk_mq_dispatch_wait_add(hctx)) { > >>>> - if (got_budget) > >>>> - blk_mq_put_dispatch_budget(hctx); > >>>> - break; > >>>> - } > >>>> - > >>>> - /* > >>>> - * It's possible that a tag was freed in the window > >>>> - * between the allocation failure and adding the > >>>> - * hardware queue to the wait queue. > >>>> - */ > >>>> - if (!blk_mq_get_driver_tag(rq, &hctx, false)) { > >>>> + if (!blk_mq_dispatch_wait_add(&hctx, rq)) { > >>>> if (got_budget) > >>>> blk_mq_put_dispatch_budget(hctx); > >>>> + no_tag = true; > >>>> break; > >>>> } > >>>> } > >>>> @@ -1140,10 +1148,10 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > >>>> * it is no longer set that means that it was cleared by another > >>>> * thread and hence that a queue rerun is needed. > >>>> * > >>>> - * If TAG_WAITING is set that means that an I/O scheduler has > >>>> - * been configured and another thread is waiting for a driver > >>>> - * tag. To guarantee fairness, do not rerun this hardware queue > >>>> - * but let the other thread grab the driver tag. > >>>> + * If 'no_tag' is set, that means that we failed getting > >>>> + * a driver tag with an I/O scheduler attached. If our dispatch > >>>> + * waitqueue is no longer active, ensure that we run the queue > >>>> + * AFTER adding our entries back to the list. > >>>> * > >>>> * If no I/O scheduler has been configured it is possible that > >>>> * the hardware queue got stopped and restarted before requests > >>>> @@ -1156,7 +1164,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > >>>> * and dm-rq. > >>>> */ > >>>> if (!blk_mq_sched_needs_restart(hctx) && > >>>> - !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) > >>>> + (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) > >>>> blk_mq_run_hw_queue(hctx, true); > >>> > >>> If one rq is just completed after the check on list_empty_careful(&hctx->dispatch_wait.entry), > >>> the queue may not be run any more. May that be an issue? > >> > >> Looks that can be an issue. > >> > >> If I revert "Revert "blk-mq: don't handle TAG_SHARED in restart"", and > >> apply 'blk-mq: put driver tag if dispatch budget can't be got' against > >> for-4.15/block, I still can trigger IO hang in one or two minutes easily: > >> > >> 1) script > >> #!/bin/sh > >> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1 > >> RUNTIME=10 > >> > >> while true; do > >> fio --bs=4k --size=512G --rw=randread --norandommap --direct=1 --ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1 --name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 --name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3 > > > > Did you apply my patch too? I had my test case running overnight, and it > > completed just fine. That's current for-4.15/block + the patch I posted. > > Previously that would hang in minutes as well. > > > > I'm running your test case now here, but it looks identical to mine. > > It's been running for 5 min without issue so far, I'll leave it running > > for an hour or so. > > It's been running happily for > 1 hour now, no issues observed. Looks there was something wrong in yesterday's test, today I can't reproduce the issue by running latest for-4.15/block with revert a2820d1544c1, and observed IOPS is improved >20% compared yesterday's result meantime. Now looks the new dispatch wake works well, and it seems fine to cover RESTART for TAG_SHARED completely. In next dev cycle, we may consider to remove that workaround. -- Ming