[bug report] Hang on sync after dd

John Garry <john.garry@xxxxxxxxxx> · Mon, 30 Nov 2020 11:22:33 +0000

Hi all,

Some guys internally upgraded to v5.10-rcX and start to see a hang after 
dd + sync for a large file:
- mount /dev/sda1 (ext4 filesystem) to directory /mnt;
- run "if=/dev/zero of=test1 bs=1M count=2000" on directory /mnt;
- run "sync"

and get:

[  367.912761] INFO: task jbd2/sdb1-8:3602 blocked for more than 120
seconds.
[  367.919618]       Not tainted 5.10.0-rc1-109488-g32ded76956b6 #948
[  367.925776] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  367.933579] task:jbd2/sdb1-8     state:D stack:    0 pid: 3602
ppid:     2 flags:0x00000028
[  367.941901] Call trace:
[  367.944351] __switch_to+0xb8/0x168
[  367.947840] __schedule+0x30c/0x670
[  367.951326] schedule+0x70/0x108
[  367.954550] io_schedule+0x1c/0xe8
[  367.957948] bit_wait_io+0x18/0x68
[  367.961346] __wait_on_bit+0x78/0xf0
[  367.964919] out_of_line_wait_on_bit+0x8c/0xb0
[  367.969356] __wait_on_buffer+0x30/0x40
[  367.973188] jbd2_journal_commit_transaction+0x1370/0x1958
[  367.978661] kjournald2+0xcc/0x260
[  367.982061] kthread+0x150/0x158
[  367.985288] ret_from_fork+0x10/0x34
[  367.988860] INFO: task sync:3823 blocked for more than 120 seconds.
[  367.995102]       Not tainted 5.10.0-rc1-109488-g32ded76956b6 #948
[  368.001265] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  368.009067] task:sync            state:D stack:    0 pid: 3823 ppid:
3450 flags:0x00000009
[  368.017397] Call trace:
[  368.019841] __switch_to+0xb8/0x168
[  368.023320] __schedule+0x30c/0x670
[  368.026804] schedule+0x70/0x108
[  368.030025] jbd2_log_wait_commit+0xbc/0x158
[  368.034290] ext4_sync_fs+0x188/0x1c8
[  368.037947] sync_fs_one_sb+0x30/0x40
[  368.041606] iterate_supers+0x9c/0x138
[  368.045350] ksys_sync+0x64/0xc0
[  368.048569] __arm64_sys_sync+0x10/0x20
[  368.052398] el0_svc_common.constprop.3+0x68/0x170
[  368.057177] do_el0_svc+0x24/0x90
[  368.060482] el0_sync_handler+0x118/0x168
[  368.064478]  el0_sync+0x158/0x180

The issue was reported here originally:
https://lore.kernel.org/linux-ext4/4d18326e-9ca2-d0cb-7cb8-cb56981280da@xxxxxxxxxxxxx/

But it looks like issue related to recent work for SCSI MQ.

They can only create with hisi_sas v3 hw. I could not create with 
megaraid sas on the same dev platform or hisi_sas on a similar dev board.

Reverting "scsi: core: Only re-run queue in scsi_end_request() if device 
queue is busy" seems solve the issue. Also, checking out to patch prior 
to "scsi: hisi_sas: Switch v3 hw to MQ" seems to not have the issue.

It occurs on SATA disks only, rather than SAS disks also, it seems. 
Maybe related to IO depth being smaller on SATA disk (32 vs 64), which 
can cause budget fail.

"mq-deadline" and "none" schedulers look to have the issue.

@Kashyap, have you guys tested megaraid sas much for this?

Thanks,
John

Block debugfs info is as follows:

estuary:/sys/kernel/debug/block/sda/hctx8$ cat
active cpu101/ cpu96/ cpu99/ dispatch_busy io_poll sched_tags tags
busy cpu102/ cpu97/ ctx_map dispatched queued sched_tags_bitmap tags_bitmap
cpu100/ cpu103/ cpu98/ dispatch flags run state type
estuary:/sys/kernel/debug/block/sda/hctx8$ cat cpu
cpu100/ cpu101/ cpu102/ cpu103/ cpu96/ cpu97/ cpu98/ cpu99/
estuary:/sys/kernel/debug/block/sda/hctx8$ cat cpu
cpu100/ cpu101/ cpu102/ cpu103/ cpu96/ cpu97/ cpu98/ cpu99/
estuary:/sys/kernel/debug/block/sda/hctx8$ cat cpu96/
completed default_rq_list dispatched merged poll_rq_list read_rq_list
estuary:/sys/kernel/debug/block/sda/hctx8$ cat cpu96/dispatched
0 0
estuary:/sys/kernel/debug/block/sda/hctx8$ cat cpu97/dispatched
0 0
estuary:/sys/kernel/debug/block/sda/hctx8$ cat cpu98/dispatched
0 0
estuary:/sys/kernel/debug/block/sda/hctx8$ cat cpu99/dispatched
0 0
estuary:/sys/kernel/debug/block/sda/hctx8$ cat cpu100/dispatched
3 0
estuary:/sys/kernel/debug/block/sda/hctx8$ cat cpu100/completed
2 0
estuary:/sys/kernel/debug/block/sda/hctx8$
estuary:/sys/kernel/debug/block/sda/hctx8$
estuary:/sys/kernel/debug/block/sda/hctx8$ cat state
SCHED_RESTART
estuary:/sys/kernel/debug/block/sda/hctx8$ ls
active cpu101 cpu96 cpu99 dispatch_busy io_poll sched_tags tags
busy cpu102 cpu97 ctx_map dispatched queued sched_tags_bitmap tags_bitmap
cpu100 cpu103 cpu98 dispatch flags run state type
estuary:/sys/kernel/debug/block/sda/hctx8$ cat dispatch
000000007abb596e {.op=FLUSH, .cmd_flags=PREFLUSH, 
.rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP, .state=idle, .tag=21, 
.internal_tag=-1, .cmd=opcode=0x35 35 00 00 00 00 00 00 00 00 00, 
.retries=0, .result = 0x0, .flags=TAGGED|INITIALIZED|3, .timeout=60.000, 
allocated 2208.876 s ago}
estuary:/sys/kernel/debug/block/sda/hctx8$

On cpu100, it seems completed is less than number dispatched.