Hi all, I could need some help with sorting out the following fall-out that a recent addition to our test-suite seems to trigger consistently on a range of recent upstream kernel-versions. This hits with the "debug_defconfig" on ARCH=s390 - while "defconfig" runs clean. But then, CONFIG_WQ_WATCHDOG is configured only in "debug_defconfig". There might be some interaction between other config options in "debug_defconfig", as the BUG does not show up with just the CONFIG_WQ_WATCHDOG added to "defconfig". I wonder if other ARCH's see similar issues with this kind of stress? At any rate, I'm open to suggestions on how to attack this issue. I'm also open to any advice on whether the combination of test-case and addtional CONFIG options in "debug_defconfig" is abusive... For a clear base-line I just re-created this on a 6.3.0 kernel with the following "workload" fio --readonly pci_nvme_read.fiojob with pci_nvme_read.fiojob specifying: [iopstest] filename=/dev/nvme0n1 rw=read size=512m blocksize=4k iodepth=32 numjobs=14 (and higher up to $nproc=18) time_based runtime=1m ramp_time=10s ioengine=io_uring (also with libaio) direct=1 norandommap group_reporting=1 I get the following messages logged: 2023/05/04 19:31:26 [t35lp67]: [ 235.438022] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 65s! 2023/05/04 19:31:26 [t35lp67]: [ 235.438265] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 61s! 2023/05/04 19:31:26 [t35lp67]: [ 235.438278] BUG: workqueue lockup - pool cpus=10 node=0 flags=0x0 nice=0 stuck for 46s! 2023/05/04 19:31:26 [t35lp67]: [ 235.438290] BUG: workqueue lockup - pool cpus=10 node=0 flags=0x0 nice=-20 stuck for 41s! 2023/05/04 19:31:26 [t35lp67]: [ 235.438438] Showing busy workqueues and worker pools: 2023/05/04 19:31:26 [t35lp67]: [ 235.438621] pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 2023/05/04 19:31:26 [t35lp67]: [ 235.438624] pending: kfree_rcu_monitor 2023/05/04 19:31:26 [t35lp67]: [ 235.438493] workqueue events: flags=0x0 2023/05/04 19:31:26 [t35lp67]: [ 235.438634] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3 2023/05/04 19:31:26 [t35lp67]: [ 235.438636] pending: vmstat_shepherd, kfree_rcu_monitor 2023/05/04 19:31:26 [t35lp67]: [ 235.438736] workqueue events_unbound: flags=0x2 2023/05/04 19:31:26 [t35lp67]: [ 235.438752] pwq 760: cpus=0-379 flags=0x4 nice=0 active=1/1520 refcnt=3 2023/05/04 19:31:26 [t35lp67]: [ 235.438754] in-flight: 588:toggle_allocation_gate 2023/05/04 19:31:26 [t35lp67]: [ 235.438944] workqueue rcu_gp: flags=0x8 2023/05/04 19:31:26 [t35lp67]: [ 235.439069] pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 2023/05/04 19:31:26 [t35lp67]: [ 235.439072] in-flight: 259:wait_rcu_exp_gp 2023/05/04 19:31:27 [t35lp67]: [ 235.439125] workqueue rcu_par_gp: flags=0x8 2023/05/04 19:31:27 [t35lp67]: [ 235.439251] pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=2/256 refcnt=4 2023/05/04 19:31:27 [t35lp67]: [ 235.439255] pending: sync_rcu_exp_select_node_cpus 2023/05/04 19:31:27 [t35lp67]: [ 235.439413] workqueue mm_percpu_wq: flags=0x8 2023/05/04 19:31:27 [t35lp67]: [ 235.439253] in-flight: 115:sync_rcu_exp_select_node_cpus BAR(259) 2023/05/04 19:31:27 [t35lp67]: [ 235.439542] pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 2023/05/04 19:31:27 [t35lp67]: [ 235.439544] pending: lru_add_drain_per_cpu 2023/05/04 19:31:27 [t35lp67]: [ 235.439550] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=3 2023/05/04 19:31:30 [t35lp67]: [ 235.439552] pending: lru_add_drain_per_cpu BAR(140) 2023/05/04 19:31:30 [t35lp67]: [ 235.439817] workqueue writeback: flags=0x4a 2023/05/04 19:31:30 [t35lp67]: [ 235.439969] pwq 760: cpus=0-379 flags=0x4 nice=0 active=1/256 refcnt=3 2023/05/04 19:31:30 [t35lp67]: [ 235.439971] in-flight: 1345:wb_workfn 2023/05/04 19:31:30 [t35lp67]: [ 235.440110] workqueue kblockd: flags=0x18 2023/05/04 19:31:30 [t35lp67]: [ 235.440238] pwq 21: cpus=10 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2 2023/05/04 19:31:30 [t35lp67]: [ 235.440241] pending: blk_mq_run_work_fn 2023/05/04 19:31:30 [t35lp67]: [ 235.441046] workqueue mld: flags=0x40008 2023/05/04 19:31:30 [t35lp67]: [ 235.441173] pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=1/1 refcnt=2 2023/05/04 19:31:30 [t35lp67]: [ 235.441175] pending: mld_ifc_work 2023/05/04 19:31:30 [t35lp67]: [ 235.441611] pwq 760: cpus=0-379 flags=0x4 nice=0 active=1/1 refcnt=4 2023/05/04 19:31:30 [t35lp67]: [ 235.441603] workqueue mlx5e: flags=0xe000a 2023/05/04 19:31:30 [t35lp67]: [ 235.441613] in-flight: 12:mlx5e_set_rx_mode_work [mlx5_core] 2023/05/04 19:31:30 [t35lp67]: [ 235.441741] inactive: mlx5e_set_rx_mode_work [mlx5_core] 2023/05/04 19:31:30 [t35lp67]: [ 235.443027] pool 20: cpus=10 node=0 flags=0x0 nice=0 hung=46s workers=3 idle: 69 2023/05/04 19:31:30 [t35lp67]: [ 235.443339] pool 760: cpus=0-379 flags=0x4 nice=0 hung=0s workers=19 idle: 13 568 10 586 1987 567 587 1988 147 584 1321 1353 1986 1989 1990 1991 Thank you, Gerd