On Wed, Nov 08, 2017 at 02:20:41PM +0800, Ming Lei wrote: > On Tue, Nov 07, 2017 at 08:17:59PM -0700, Jens Axboe wrote: > > On 11/07/2017 08:12 PM, Ming Lei wrote: > > > On Tue, Nov 07, 2017 at 08:01:48PM -0700, Jens Axboe wrote: > > >> On 11/07/2017 06:03 PM, Ming Lei wrote: > > >>> On Tue, Nov 07, 2017 at 03:06:31PM -0700, Jens Axboe wrote: > > >>>> On 11/07/2017 10:36 AM, Jens Axboe wrote: > > >>>>> On 11/07/2017 10:10 AM, Jens Axboe wrote: > > >>>>>> On 11/07/2017 09:29 AM, Jens Axboe wrote: > > >>>>>>> On 11/07/2017 09:20 AM, Bart Van Assche wrote: > > >>>>>>>> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote: > > >>>>>>>>> If you can reproduce, please provide me at least the following log > > >>>>>>>>> first: > > >>>>>>>>> > > >>>>>>>>> find /sys/kernel/debug/block -name tags | xargs cat | grep busy > > >>>>>>>>> > > >>>>>>>>> If any pending requests arn't completed, please provide the related > > >>>>>>>>> info in dbgfs about where is the request. > > >>>>>>>> > > >>>>>>>> Every time I ran the above or a similar command its output was empty. I > > >>>>>>>> assume that's because the hang usually occurs in a phase where these debugfs > > >>>>>>>> attributes either have not yet been created or have already disappeared. > > >>>>>>> > > >>>>>>> Bart, do you still see a hang with the patch that fixes the tag leak when > > >>>>>>> we fail to get a dispatch budget? > > >>>>>>> > > >>>>>>> https://marc.info/?l=linux-block&m=151004881411480&w=2 > > >>>>>>> > > >>>>>>> I've run a lot of stability testing here, and I haven't run into any > > >>>>>>> issues. This is with shared tags as well. So if you still see the failure > > >>>>>>> case with the current tree AND the above patch, then I'll try and get > > >>>>>>> a test case setup that hits it too so we can get to the bottom of this. > > >>>>>> > > >>>>>> Ming, I managed to reproduce the hang using null_blk. Note this is > > >>>>>> WITHOUT the patch mentioned above, running with that now. > > >>>>>> > > >>>>>> # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1 > > >>>>>> > > >>>>>> and using this fio file: > > >>>>>> > > >>>>>> [global] > > >>>>>> bs=4k > > >>>>>> rw=randread > > >>>>>> norandommap > > >>>>>> direct=1 > > >>>>>> ioengine=libaio > > >>>>>> iodepth=4 > > >>>>>> > > >>>>>> [nullb0] > > >>>>>> filename=/dev/nullb0 > > >>>>>> [nullb1] > > >>>>>> filename=/dev/nullb1 > > >>>>>> [nullb2] > > >>>>>> filename=/dev/nullb2 > > >>>>>> [nullb3] > > >>>>>> filename=/dev/nullb3 > > >>>>>> > > >>>>>> it seemed to keep running, but it hung when exiting. The troublesome > > >>>>>> device was nullb1: > > >>>>>> > > >>>>>> [ 492.513374] INFO: task fio:3263 blocked for more than 120 seconds. > > >>>>>> [ 492.520782] Not tainted 4.14.0-rc7+ #499 > > >>>>>> [ 492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > >>>>>> [ 492.535904] fio D13208 3263 3211 0x00000000 > > >>>>>> [ 492.542535] Call Trace: > > >>>>>> [ 492.545764] __schedule+0x279/0x720 > > >>>>>> [ 492.550151] schedule+0x33/0x90 > > >>>>>> [ 492.554145] io_schedule+0x16/0x40 > > >>>>>> [ 492.558435] blk_mq_get_tag+0x148/0x250 > > >>>>>> [ 492.563211] ? finish_wait+0x90/0x90 > > >>>>>> [ 492.567693] blk_mq_get_request+0xf0/0x3e0 > > >>>>>> [ 492.572760] blk_mq_make_request+0xe2/0x690 > > >>>>>> [ 492.577913] generic_make_request+0xfc/0x2f0 > > >>>>>> [ 492.583177] submit_bio+0x64/0x120 > > >>>>>> [ 492.587475] ? set_page_dirty_lock+0x4b/0x60 > > >>>>>> [ 492.592736] ? submit_bio+0x64/0x120 > > >>>>>> [ 492.597309] ? bio_set_pages_dirty+0x55/0x60 > > >>>>>> [ 492.602570] blkdev_direct_IO+0x388/0x3c0 > > >>>>>> [ 492.607546] ? free_ioctx_users+0xe0/0xe0 > > >>>>>> [ 492.612511] ? blk_mq_dispatch_rq_list+0x238/0x3a0 > > >>>>>> [ 492.618353] ? _raw_spin_unlock+0xe/0x20 > > >>>>>> [ 492.623227] generic_file_read_iter+0xb3/0xa00 > > >>>>>> [ 492.628682] ? generic_file_read_iter+0xb3/0xa00 > > >>>>>> [ 492.634334] ? security_file_permission+0x9b/0xc0 > > >>>>>> [ 492.640114] blkdev_read_iter+0x35/0x40 > > >>>>>> [ 492.644877] aio_read+0xc5/0x120 > > >>>>>> [ 492.648973] ? aio_read_events+0x24c/0x340 > > >>>>>> [ 492.654124] ? __might_sleep+0x4a/0x80 > > >>>>>> [ 492.658800] do_io_submit+0x47c/0x5e0 > > >>>>>> [ 492.663373] ? do_io_submit+0x47c/0x5e0 > > >>>>>> [ 492.668234] SyS_io_submit+0x10/0x20 > > >>>>>> [ 492.672715] ? SyS_io_submit+0x10/0x20 > > >>>>>> [ 492.677394] entry_SYSCALL_64_fastpath+0x13/0x94 > > >>>>>> [ 492.683039] RIP: 0033:0x7f83d1871717 > > >>>>>> [ 492.687521] RSP: 002b:00007ffd38fe5a88 EFLAGS: 00000202 ORIG_RAX: 00000000000000d1 > > >>>>>> [ 492.696969] RAX: ffffffffffffffda RBX: 00007f83b6972b50 RCX: 00007f83d1871717 > > >>>>>> [ 492.705423] RDX: 0000000001f41418 RSI: 0000000000000001 RDI: 00007f83e4d36000 > > >>>>>> [ 492.713889] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000001f3f2e0 > > >>>>>> [ 492.722352] R10: 0000000000001000 R11: 0000000000000202 R12: 00007ffd38fe5be0 > > >>>>>> [ 492.730827] R13: 00007f83b6972b01 R14: 00007f83b69824b8 R15: 00007f83b6982368 > > >>>>>> > > >>>>>> and if we look at the debug entries, it's waiting on a scheduler tag: > > >>>>>> > > >>>>>> sched_tags:nr_tags=2 > > >>>>>> sched_tags:nr_reserved_tags=0 > > >>>>>> sched_tags:active_queues=0 > > >>>>>> sched_tags:bitmap_tags: > > >>>>>> sched_tags:depth=2 > > >>>>>> sched_tags:busy=2 > > >>>>>> sched_tags:bits_per_word=64 > > >>>>>> sched_tags:map_nr=1 > > >>>>>> sched_tags:alloc_hint={0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0} > > >>>>>> sched_tags:wake_batch=1 > > >>>>>> sched_tags:wake_index=4 > > >>>>>> sched_tags:ws={ > > >>>>>> sched_tags: {.wait_cnt=-102, .wait=inactive}, > > >>>>>> sched_tags: {.wait_cnt=-126, .wait=inactive}, > > >>>>>> sched_tags: {.wait_cnt=-72, .wait=inactive}, > > >>>>>> sched_tags: {.wait_cnt=-96, .wait=inactive}, > > >>>>>> sched_tags: {.wait_cnt=-134, .wait=inactive}, > > >>>>>> sched_tags: {.wait_cnt=-116, .wait=inactive}, > > >>>>>> sched_tags: {.wait_cnt=-90, .wait=inactive}, > > >>>>>> sched_tags: {.wait_cnt=-115, .wait=active}, > > >>>>>> sched_tags:} > > >>>>>> sched_tags:round_robin=0 > > >>>>>> sched_tags_bitmap:00000000: 03 > > >>>>>> > > >>>>>> with SCHED_RESTART being set: > > >>>>>> > > >>>>>> state:SCHED_RESTART > > >>>>>> > > >>>>>> and with the driver tags being idle: > > >>>>>> > > >>>>>> tags:nr_tags=1 > > >>>>>> tags:nr_reserved_tags=0 > > >>>>>> tags:active_queues=0 > > >>>>>> tags:bitmap_tags: > > >>>>>> tags:depth=1 > > >>>>>> tags:busy=0 > > >>>>>> tags:bits_per_word=64 > > >>>>>> tags:map_nr=1 > > >>>>>> tags:alloc_hint={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} > > >>>>>> tags:wake_batch=1 > > >>>>>> tags:wake_index=3 > > >>>>>> tags:ws={ > > >>>>>> tags: {.wait_cnt=-48, .wait=inactive}, > > >>>>>> tags: {.wait_cnt=-39, .wait=inactive}, > > >>>>>> tags: {.wait_cnt=-50, .wait=inactive}, > > >>>>>> tags: {.wait_cnt=-47, .wait=inactive}, > > >>>>>> tags: {.wait_cnt=-25, .wait=inactive}, > > >>>>>> tags: {.wait_cnt=-24, .wait=inactive}, > > >>>>>> tags: {.wait_cnt=-47, .wait=inactive}, > > >>>>>> tags: {.wait_cnt=-47, .wait=inactive}, > > >>>>>> tags:} > > >>>>> > > >>>>> Unsurprisingly (since this is sched_tags starvation) this still happens > > >>>>> with the patch. Same trace as above. Note that dispatch has two requests > > >>>>> sitting ready: > > >>>>> > > >>>>> ffff881ff5ab1000 {.op=READ, .cmd_flags=, .rq_flags=STARTED|IO_STAT, .atomic_flags=COMPLETE, .tag=-1, .internal_tag=0} > > >>>>> ffff881ff5ab1240 {.op=READ, .cmd_flags=, .rq_flags=STARTED|IO_STAT, .atomic_flags=COMPLETE, .tag=-1, .internal_tag=1} > > >>>>> > > >>>>> while we're blocked trying to get a new sched tag off the regular > > >>>>> blk_mq_make_request() path. It looks like a missing restart of the hctx. > > >>>> > > >>>> Just to keep everyone in the loop, this bug is not new to > > >>>> for-4.15/block, nor is it new to the current 4.41-rc or 4.13. So it's > > >>>> probably different to what Bart is hitting, but it's a bug none the > > >>>> less... > > >>> > > >>> Thanks for running this test ans share us the result. > > >>> > > >>>> > > >>>> To me, this looks like a race between marking the queue as needing a > > >>>> restart, and the driver tag being released and re-running the queue. If > > >>>> I do: > > >>>> > > >>>> # echo run > /sys/kernel/debug/block/nullb1/state > > >>>> > > >>>> then it starts just fine. So there must be a race between marking the > > >>>> need for a restart (when getting the driver tag fails), and when one of > > >>>> the shared tag queues finishes a request and releases the driver tag. > > >>> > > >>> Just one hint, blk-mq's RESTART won't work if no requests are pending, > > >>> but looks no efficient/reliable way to handle that. > > >> > > >> Requests are pending, see above output. My assumption/theory was that > > >> we were racing on setting restart. Simplified version: > > >> > > >> CPUX CPUY > > >> get_driver_tag, fails > > >> complete request, put tag > > >> check restart, not set > > >> mark needing restart > > > > > > This case should be covered by TAG_WAITING. > > > > Right. The keyword here being 'should'. > > > > > If get_driver_tag fails on CPUX, this hctx will be added to tags's wait > > > queue, then the 'complete request' on CPUY will trigger to run queue via > > > blk_mq_dispatch_wake(). > > > > I know how it's _supposed_ to work. I'm saying that the debug info > > indicates that we have a problem in there, which matches what the > > TAG_WAITING should be covering for us. The TAG_WAITING may be fine and > > the issue could be somewhere else, it just looks identical for now. > > One possible reason: > > - the request to be dispatched may not be added to hctx->dispatch > when blk_mq_dispatch_wake() is triggered. If we call blk_mq_dispatch_wait_add() after request is added to hctx->dispatch, TAG_WAITING becomes basically similar with SCHED_RESTART. Now I don't object to revert 358a3a6bccb7(blk-mq: don't handle TAG_SHARED in restart) any more, because SCHED_RESTART works fine from this viewpoint even though it isn't efficient. We can improve it in the future. Also we might need to unify both TAG_WAITING and SCHED_RESTART into one generic restart mechanism, and make it efficient. IMO, both two aren't efficient enough. -- Ming