On 9/21/21 12:40 AM, Dave Chinner wrote: > Hi Jens, > > I updated all my trees from 5.14 to 5.15-rc2 this morning and > immediately had problems running the recoveryloop fstest group on > them. These tests have a typical pattern of "run load in the > background, shutdown the filesystem, kill load, unmount and test > recovery". > > Whent eh load includes fsstress, and it gets killed after shutdown, > it hangs on exit like so: > > # echo w > /proc/sysrq-trigger > [ 370.669482] sysrq: Show Blocked State > [ 370.671732] task:fsstress state:D stack:11088 pid: 9619 ppid: 9615 flags:0x00000000 > [ 370.675870] Call Trace: > [ 370.677067] __schedule+0x310/0x9f0 > [ 370.678564] schedule+0x67/0xe0 > [ 370.679545] schedule_timeout+0x114/0x160 > [ 370.682002] __wait_for_common+0xc0/0x160 > [ 370.684274] wait_for_completion+0x24/0x30 > [ 370.685471] do_coredump+0x202/0x1150 > [ 370.690270] get_signal+0x4c2/0x900 > [ 370.691305] arch_do_signal_or_restart+0x106/0x7a0 > [ 370.693888] exit_to_user_mode_prepare+0xfb/0x1d0 > [ 370.695241] syscall_exit_to_user_mode+0x17/0x40 > [ 370.696572] do_syscall_64+0x42/0x80 > [ 370.697620] entry_SYSCALL_64_after_hwframe+0x44/0xae > > It's 100% reproducable on one of my test machines, but only one of > them. That one machine is running fstests on pmem, so it has > synchronous storage. Every other test machine using normal async > storage (nvme, iscsi, etc) and none of them are hanging. > > A quick troll of the commit history between 5.14 and 5.15-rc2 > indicates a couple of potential candidates. The 5th kernel build > (instead of ~16 for a bisect) told me that commit 15e20db2e0ce > ("io-wq: only exit on fatal signals") is the cause of the > regression. I've confirmed that this is the first commit where the > problem shows up. Thanks for the report Dave, I'll take a look. Can you elaborate on exactly what is being run? And when killed, it's a non-fatal signal? -- Jens Axboe