Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

Jens Axboe <axboe@xxxxxxxxx> · Tue, 7 Nov 2017 20:17:59 -0700

On 11/07/2017 08:12 PM, Ming Lei wrote:
> On Tue, Nov 07, 2017 at 08:01:48PM -0700, Jens Axboe wrote:
>> On 11/07/2017 06:03 PM, Ming Lei wrote:
>>> On Tue, Nov 07, 2017 at 03:06:31PM -0700, Jens Axboe wrote:
>>>> On 11/07/2017 10:36 AM, Jens Axboe wrote:
>>>>> On 11/07/2017 10:10 AM, Jens Axboe wrote:
>>>>>> On 11/07/2017 09:29 AM, Jens Axboe wrote:
>>>>>>> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
>>>>>>>> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
>>>>>>>>> If you can reproduce, please provide me at least the following log
>>>>>>>>> first:
>>>>>>>>>
>>>>>>>>> 	find /sys/kernel/debug/block -name tags | xargs cat | grep busy
>>>>>>>>>
>>>>>>>>> If any pending requests arn't completed, please provide the related
>>>>>>>>> info in dbgfs about where is the request.
>>>>>>>>
>>>>>>>> Every time I ran the above or a similar command its output was empty. I
>>>>>>>> assume that's because the hang usually occurs in a phase where these debugfs
>>>>>>>> attributes either have not yet been created or have already disappeared.
>>>>>>>
>>>>>>> Bart, do you still see a hang with the patch that fixes the tag leak when
>>>>>>> we fail to get a dispatch budget?
>>>>>>>
>>>>>>> https://marc.info/?l=linux-block&m=151004881411480&w=2
>>>>>>>
>>>>>>> I've run a lot of stability testing here, and I haven't run into any
>>>>>>> issues. This is with shared tags as well. So if you still see the failure
>>>>>>> case with the current tree AND the above patch, then I'll try and get
>>>>>>> a test case setup that hits it too so we can get to the bottom of this.
>>>>>>
>>>>>> Ming, I managed to reproduce the hang using null_blk. Note this is
>>>>>> WITHOUT the patch mentioned above, running with that now.
>>>>>>
>>>>>> # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1
>>>>>>
>>>>>> and using this fio file:
>>>>>>
>>>>>> [global]
>>>>>> bs=4k
>>>>>> rw=randread
>>>>>> norandommap
>>>>>> direct=1
>>>>>> ioengine=libaio
>>>>>> iodepth=4
>>>>>>
>>>>>> [nullb0]
>>>>>> filename=/dev/nullb0
>>>>>> [nullb1]
>>>>>> filename=/dev/nullb1
>>>>>> [nullb2]
>>>>>> filename=/dev/nullb2
>>>>>> [nullb3]
>>>>>> filename=/dev/nullb3
>>>>>>
>>>>>> it seemed to keep running, but it hung when exiting. The troublesome
>>>>>> device was nullb1:
>>>>>>
>>>>>> [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
>>>>>> [  492.520782]       Not tainted 4.14.0-rc7+ #499
>>>>>> [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>> [  492.535904] fio             D13208  3263   3211 0x00000000
>>>>>> [  492.542535] Call Trace:
>>>>>> [  492.545764]  __schedule+0x279/0x720
>>>>>> [  492.550151]  schedule+0x33/0x90
>>>>>> [  492.554145]  io_schedule+0x16/0x40
>>>>>> [  492.558435]  blk_mq_get_tag+0x148/0x250
>>>>>> [  492.563211]  ? finish_wait+0x90/0x90
>>>>>> [  492.567693]  blk_mq_get_request+0xf0/0x3e0
>>>>>> [  492.572760]  blk_mq_make_request+0xe2/0x690
>>>>>> [  492.577913]  generic_make_request+0xfc/0x2f0
>>>>>> [  492.583177]  submit_bio+0x64/0x120
>>>>>> [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
>>>>>> [  492.592736]  ? submit_bio+0x64/0x120
>>>>>> [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
>>>>>> [  492.602570]  blkdev_direct_IO+0x388/0x3c0
>>>>>> [  492.607546]  ? free_ioctx_users+0xe0/0xe0
>>>>>> [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
>>>>>> [  492.618353]  ? _raw_spin_unlock+0xe/0x20
>>>>>> [  492.623227]  generic_file_read_iter+0xb3/0xa00
>>>>>> [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
>>>>>> [  492.634334]  ? security_file_permission+0x9b/0xc0
>>>>>> [  492.640114]  blkdev_read_iter+0x35/0x40
>>>>>> [  492.644877]  aio_read+0xc5/0x120
>>>>>> [  492.648973]  ? aio_read_events+0x24c/0x340
>>>>>> [  492.654124]  ? __might_sleep+0x4a/0x80
>>>>>> [  492.658800]  do_io_submit+0x47c/0x5e0
>>>>>> [  492.663373]  ? do_io_submit+0x47c/0x5e0
>>>>>> [  492.668234]  SyS_io_submit+0x10/0x20
>>>>>> [  492.672715]  ? SyS_io_submit+0x10/0x20
>>>>>> [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
>>>>>> [  492.683039] RIP: 0033:0x7f83d1871717
>>>>>> [  492.687521] RSP: 002b:00007ffd38fe5a88 EFLAGS: 00000202 ORIG_RAX: 00000000000000d1
>>>>>> [  492.696969] RAX: ffffffffffffffda RBX: 00007f83b6972b50 RCX: 00007f83d1871717
>>>>>> [  492.705423] RDX: 0000000001f41418 RSI: 0000000000000001 RDI: 00007f83e4d36000
>>>>>> [  492.713889] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000001f3f2e0
>>>>>> [  492.722352] R10: 0000000000001000 R11: 0000000000000202 R12: 00007ffd38fe5be0
>>>>>> [  492.730827] R13: 00007f83b6972b01 R14: 00007f83b69824b8 R15: 00007f83b6982368
>>>>>>
>>>>>> and if we look at the debug entries, it's waiting on a scheduler tag:
>>>>>>
>>>>>> sched_tags:nr_tags=2
>>>>>> sched_tags:nr_reserved_tags=0
>>>>>> sched_tags:active_queues=0
>>>>>> sched_tags:bitmap_tags:
>>>>>> sched_tags:depth=2
>>>>>> sched_tags:busy=2
>>>>>> sched_tags:bits_per_word=64
>>>>>> sched_tags:map_nr=1
>>>>>> sched_tags:alloc_hint={0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0}
>>>>>> sched_tags:wake_batch=1
>>>>>> sched_tags:wake_index=4
>>>>>> sched_tags:ws={
>>>>>> sched_tags:	{.wait_cnt=-102, .wait=inactive},
>>>>>> sched_tags:	{.wait_cnt=-126, .wait=inactive},
>>>>>> sched_tags:	{.wait_cnt=-72, .wait=inactive},
>>>>>> sched_tags:	{.wait_cnt=-96, .wait=inactive},
>>>>>> sched_tags:	{.wait_cnt=-134, .wait=inactive},
>>>>>> sched_tags:	{.wait_cnt=-116, .wait=inactive},
>>>>>> sched_tags:	{.wait_cnt=-90, .wait=inactive},
>>>>>> sched_tags:	{.wait_cnt=-115, .wait=active},
>>>>>> sched_tags:}
>>>>>> sched_tags:round_robin=0
>>>>>> sched_tags_bitmap:00000000: 03
>>>>>>
>>>>>> with SCHED_RESTART being set:
>>>>>>
>>>>>> state:SCHED_RESTART
>>>>>>
>>>>>> and with the driver tags being idle:
>>>>>>
>>>>>> tags:nr_tags=1
>>>>>> tags:nr_reserved_tags=0
>>>>>> tags:active_queues=0
>>>>>> tags:bitmap_tags:
>>>>>> tags:depth=1
>>>>>> tags:busy=0
>>>>>> tags:bits_per_word=64
>>>>>> tags:map_nr=1
>>>>>> tags:alloc_hint={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
>>>>>> tags:wake_batch=1
>>>>>> tags:wake_index=3
>>>>>> tags:ws={
>>>>>> tags:	{.wait_cnt=-48, .wait=inactive},
>>>>>> tags:	{.wait_cnt=-39, .wait=inactive},
>>>>>> tags:	{.wait_cnt=-50, .wait=inactive},
>>>>>> tags:	{.wait_cnt=-47, .wait=inactive},
>>>>>> tags:	{.wait_cnt=-25, .wait=inactive},
>>>>>> tags:	{.wait_cnt=-24, .wait=inactive},
>>>>>> tags:	{.wait_cnt=-47, .wait=inactive},
>>>>>> tags:	{.wait_cnt=-47, .wait=inactive},
>>>>>> tags:}
>>>>>
>>>>> Unsurprisingly (since this is sched_tags starvation) this still happens
>>>>> with the patch. Same trace as above. Note that dispatch has two requests
>>>>> sitting ready:
>>>>>
>>>>> ffff881ff5ab1000 {.op=READ, .cmd_flags=, .rq_flags=STARTED|IO_STAT, .atomic_flags=COMPLETE, .tag=-1, .internal_tag=0}
>>>>> ffff881ff5ab1240 {.op=READ, .cmd_flags=, .rq_flags=STARTED|IO_STAT, .atomic_flags=COMPLETE, .tag=-1, .internal_tag=1}
>>>>>
>>>>> while we're blocked trying to get a new sched tag off the regular
>>>>> blk_mq_make_request() path. It looks like a missing restart of the hctx.
>>>>
>>>> Just to keep everyone in the loop, this bug is not new to
>>>> for-4.15/block, nor is it new to the current 4.41-rc or 4.13. So it's
>>>> probably different to what Bart is hitting, but it's a bug none the
>>>> less...
>>>
>>> Thanks for running this test ans share us the result.
>>>
>>>>
>>>> To me, this looks like a race between marking the queue as needing a
>>>> restart, and the driver tag being released and re-running the queue. If
>>>> I do:
>>>>
>>>> # echo run > /sys/kernel/debug/block/nullb1/state
>>>>
>>>> then it starts just fine. So there must be a race between marking the
>>>> need for a restart (when getting the driver tag fails), and when one of
>>>> the shared tag queues finishes a request and releases the driver tag.
>>>
>>> Just one hint, blk-mq's RESTART won't work if no requests are pending,
>>> but looks no efficient/reliable way to handle that.
>>
>> Requests are pending, see above output. My assumption/theory was that
>> we were racing on setting restart. Simplified version:
>>
>> CPUX					CPUY
>> get_driver_tag, fails
>> 					complete request, put tag
>> 					check restart, not set
>> mark needing restart
> 
> This case should be covered by TAG_WAITING.

Right. The keyword here being 'should'.

> If get_driver_tag fails on CPUX, this hctx will be added to tags's wait
> queue, then the 'complete request' on CPUY will trigger to run queue via
> blk_mq_dispatch_wake().

I know how it's _supposed_ to work. I'm saying that the debug info
indicates that we have a problem in there, which matches what the
TAG_WAITING should be covering for us. The TAG_WAITING may be fine and
the issue could be somewhere else, it just looks identical for now.

> Looks get_driver_tag() is run again after the hctx is added to tags's
> wait queue, so TAG_WAITING should work fine for this case in theory.

Yep, in theory it should be fine. I'm going to dive in tomorrow and
figure out where it falls apart.

CC'ing Omar - one thing we talked about today was the issue being
somewhat similar to what kyber recently fixed in this commit:

commit 8cf466602028196b939255f1eb4e9817efd1db6d
Author: Omar Sandoval <osandov@xxxxxx>
Date:   Wed Oct 11 10:39:15 2017 -0700

    kyber: fix hang on domain token wait queue

-- 
Jens Axboe