On Nov 02, 2021 / 11:44, Ming Lei wrote: > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote: > > On Nov 01, 2021 / 17:01, Jens Axboe wrote: > > > On 11/1/21 6:41 AM, Jens Axboe wrote: > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote: > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and > > > >> observed a process hang during blktests block/005 run on a NVMe device. > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122 > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my > > > >> system. This hang is not observed with HDDs or null_blk devices. > > > >> > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching > > > >> for the IRQ path") triggers the hang. When I revert this commit from the > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO > > > >> scheduler switch during IO, and the completion path change by the commit looks > > > >> affecting the scheduler switch. Comments for solution will be appreciated. > > > > > > > > I'll take a look at this. > > > > > > I've tried running various things most of the day, and I cannot > > > reproduce this issue nor do I see what it could be. Even if requests are > > > split between batched completion and one-by-one completion, it works > > > just fine for me. No special care needs to be taken for put_many() on > > > the queue reference, as the wake_up() happens for the ref going to zero. > > > > > > Tell me more about your setup. What does the runtimes of the test look > > > like? Do you have all schedulers enabled? What kind of NVMe device is > > > this? > > > > Thank you for spending your precious time. With the kernel without the hang, > > the test case completes around 20 seconds. When the hang happens, the check > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not > > end, so I need to reboot the system for the next trial. While waiting the test > > case completion, the kernel repeats the same INFO message every 2 minutes. > > > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber. > > > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and > > a regular name space, and the hang is observed with both name spaces. I have > > not yet tried other NVME devices, so I will try them. > > > > > > > > FWIW, this is upstream now, so testing with Linus -git would be > > > preferable. > > > > I see. I have switched from linux-block for-next branch to the upstream branch > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed. > > Can you post the blk-mq debugfs log after the hang is triggered? > > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;) Thanks Ming. When I ran the command above, the grep command stopped when it opened tag related files in the debugfs tree. That grep command looked hanking also. So I used the find command below instead to exclude the tag related files. # find . -type f -not -name *tag* -exec grep -aH . {} \; Here I share the captured log. ./sched/queued:0 8 0 ./sched/owned_by_driver:0 8 0 ./sched/async_depth:192 ./sched/starved:0 ./sched/batching:1 ./rqos/wbt/wb_background:4 ./rqos/wbt/wb_normal:8 ./rqos/wbt/unknown_cnt:0 ./rqos/wbt/min_lat_nsec:2000000 ./rqos/wbt/inflight:0: inflight 0 ./rqos/wbt/inflight:1: inflight 0 ./rqos/wbt/inflight:2: inflight 0 ./rqos/wbt/id:0 ./rqos/wbt/enabled:1 ./rqos/wbt/curr_win_nsec:0 ./hctx7/type:default ./hctx7/dispatch_busy:7 ./hctx7/active:1024 ./hctx7/run:5 ./hctx7/ctx_map:00000000: 00 ./hctx7/dispatch:000000003dfed3fd {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192} ./hctx7/dispatch:0000000077876d9e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=193} ./hctx7/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED ./hctx7/state:TAG_ACTIVE|SCHED_RESTART ./hctx6/type:default ./hctx6/dispatch_busy:5 ./hctx6/active:1025 ./hctx6/run:4 ./hctx6/ctx_map:00000000: 00 ./hctx6/dispatch:00000000c0b8e1c9 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192} ./hctx6/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED ./hctx6/state:TAG_ACTIVE|SCHED_RESTART ./hctx5/type:default ./hctx5/dispatch_busy:5 ./hctx5/active:1024 ./hctx5/run:4 ./hctx5/ctx_map:00000000: 00 ./hctx5/dispatch:00000000aaf1e364 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128} ./hctx5/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED ./hctx5/state:TAG_ACTIVE|SCHED_RESTART ./hctx4/type:default ./hctx4/dispatch_busy:0 ./hctx4/active:1023 ./hctx4/run:1 ./hctx4/ctx_map:00000000: 00 ./hctx4/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED ./hctx3/type:default ./hctx3/dispatch_busy:5 ./hctx3/active:1024 ./hctx3/run:4 ./hctx3/ctx_map:00000000: 00 ./hctx3/dispatch:000000008b07d5e1 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128} ./hctx3/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED ./hctx3/state:TAG_ACTIVE|SCHED_RESTART ./hctx2/type:default ./hctx2/dispatch_busy:5 ./hctx2/active:1024 ./hctx2/run:4 ./hctx2/ctx_map:00000000: 00 ./hctx2/dispatch:00000000c4887013 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128} ./hctx2/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED ./hctx2/state:TAG_ACTIVE|SCHED_RESTART ./hctx1/type:default ./hctx1/dispatch_busy:6 ./hctx1/active:1024 ./hctx1/run:5 ./hctx1/ctx_map:00000000: 00 ./hctx1/dispatch:00000000efe38e4e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=0} ./hctx1/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED ./hctx1/state:TAG_ACTIVE|SCHED_RESTART ./hctx0/type:default ./hctx0/dispatch_busy:5 ./hctx0/active:1024 ./hctx0/run:4 ./hctx0/ctx_map:00000000: 00 ./hctx0/dispatch:0000000015147095 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192} ./hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED ./hctx0/state:TAG_ACTIVE|SCHED_RESTART ./write_hints:hint0: 0 ./write_hints:hint1: 0 ./write_hints:hint2: 0 ./write_hints:hint3: 0 ./write_hints:hint4: 0 ./state:SAME_COMP|NONROT|IO_STAT|DISCARD|INIT_DONE|STATS|REGISTERED|PCI_P2PDMA|NOWAIT ./pm_only:0 ./poll_stat:read (512 Bytes): samples=0 ./poll_stat:write (512 Bytes): samples=0 ./poll_stat:read (1024 Bytes): samples=0 ./poll_stat:write (1024 Bytes): samples=0 ./poll_stat:read (2048 Bytes): samples=0 ./poll_stat:write (2048 Bytes): samples=0 ./poll_stat:read (4096 Bytes): samples=0 ./poll_stat:write (4096 Bytes): samples=0 ./poll_stat:read (8192 Bytes): samples=0 ./poll_stat:write (8192 Bytes): samples=0 ./poll_stat:read (16384 Bytes): samples=0 ./poll_stat:write (16384 Bytes): samples=0 ./poll_stat:read (32768 Bytes): samples=0 ./poll_stat:write (32768 Bytes): samples=0 ./poll_stat:read (65536 Bytes): samples=0 ./poll_stat:write (65536 Bytes): samples=0 -- Best Regards, Shin'ichiro Kawasaki