[linus:master] [torture] 5d248bb39f: BUG:workqueue_lockup-pool

kernel test robot <oliver.sang@xxxxxxxxx> · Tue, 10 Oct 2023 15:54:10 +0800

hi, Dietmar Eggemann,

we don't have enough knowledge to explain below issue, just report what we
observed in our tests FYI.

since random, we run the jobs for both 5d248bb39f and parent to 999 times:

67d5404d27437689 5d248bb39fe1388943acb6510f8
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :999          6%          56:999   dmesg.BUG:workqueue_lockup-pool   <--------
           :999          0%           1:999   dmesg.INFO:rcu_preempt_detected_stalls_on_CPUs/tasks
           :999          0%           1:999   dmesg.INFO:rcu_tasks_detected_stalls_on_tasks
          2:999          2%          19:999   dmesg.INFO:task_blocked_for_more_than#seconds
          8:999         -1%            :999   dmesg.IP-Config:Auto-configuration_of_network_failed
          2:999          2%          19:999   dmesg.Kernel_panic-not_syncing:hung_task:blocked_tasks
           :999          0%           2:999   dmesg.RIP:__might_resched
           :999          0%           1:999   dmesg.RIP:debug_lockdep_rcu_enabled
           :999          0%           2:999   dmesg.RIP:ftrace_likely_update
          1:999         -0%            :999   dmesg.RIP:kasan_check_range
           :999          0%           1:999   dmesg.RIP:kfree_rcu_monitor
           :999          0%           1:999   dmesg.RIP:lock_acquire
           :999          0%           2:999   dmesg.RIP:lock_is_held_type
           :999          0%           1:999   dmesg.RIP:process_one_work
          1:999         -0%            :999   dmesg.WARNING:inconsistent_lock_state
          1:999         -0%            :999   dmesg.calltrace:irq_exit_rcu
          1:999         -0%            :999   dmesg.inconsistent{IN-HARDIRQ-W}->{HARDIRQ-ON-W}usage

as above, both commits have various random issues.
however, for dmesg.BUG:workqueue_lockup-pool, we only observed it happens for
5d248bb39f with a low rate (56 times out of 999 runs), but never happen on
parent.

below is more details. 

Hello,

kernel test robot noticed "BUG:workqueue_lockup-pool" on:

commit: 5d248bb39fe1388943acb6510f8f48fa5570e0ec ("torture: Add lock_torture writer_fifo module parameter")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master 6465e260f48790807eef06b583b38ca9789b6072]
[test failed on linux-next/master 4ae73bba62a367f2314f6ce69e3085a941983d8b]

in testcase: boot

compiler: gcc-9
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
| Closes: https://lore.kernel.org/oe-lkp/202310101456.53664ae-oliver.sang@xxxxxxxxx

[  456.238517][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 34s!
[  456.263857][    C0] Showing busy workqueues and worker pools:
[  456.283499][    C0] workqueue events_power_efficient: flags=0x80
[  456.299402][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  456.302891][    C0]     pending: check_lifetime
[  456.331425][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  464.399159][    C0] hrtimer: interrupt took 9163921 ns
[  486.958477][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 65s!
[  486.974447][    C0] Showing busy workqueues and worker pools:
[  486.984953][    C0] workqueue events_power_efficient: flags=0x80
[  486.996029][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  486.996193][    C0]     pending: check_lifetime
[  486.996710][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  517.678504][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 96s!
[  517.694928][    C0] Showing busy workqueues and worker pools:
[  517.706364][    C0] workqueue events_power_efficient: flags=0x80
[  517.716928][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  517.717091][    C0]     pending: check_lifetime
[  517.717615][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  548.398534][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 126s!
[  548.415903][    C0] Showing busy workqueues and worker pools:
[  548.445877][    C0] workqueue events_power_efficient: flags=0x80
[  548.472451][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  548.472645][    C0]     pending: check_lifetime
[  548.473106][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  579.118467][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 157s!
[  579.227632][    C0] Showing busy workqueues and worker pools:
[  579.238527][    C0] workqueue events_power_efficient: flags=0x80
[  579.249190][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  579.249353][    C0]     pending: check_lifetime
[  579.249896][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  609.838492][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 188s!
[  609.854400][    C0] Showing busy workqueues and worker pools:
[  609.869467][    C0] workqueue events_power_efficient: flags=0x80
[  609.884664][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.884855][    C0]     pending: check_lifetime
[  609.885412][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  640.558322][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 219s!
[  640.573488][    C0] Showing busy workqueues and worker pools:
[  640.584383][    C0] workqueue events_power_efficient: flags=0x80
[  640.596724][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  640.596891][    C0]     pending: check_lifetime
[  640.597439][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  671.278472][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 249s!
[  671.305078][    C0] Showing busy workqueues and worker pools:
[  671.317258][    C0] workqueue events_power_efficient: flags=0x80
[  671.333460][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  671.333624][    C0]     pending: check_lifetime
[  671.357925][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  701.998584][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 280s!
[  702.019680][    C0] Showing busy workqueues and worker pools:
[  702.030924][    C0] workqueue events_power_efficient: flags=0x80
[  702.042393][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  702.043337][    C0]     pending: check_lifetime
[  702.044403][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  732.718468][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 311s!
[  732.741426][    C0] Showing busy workqueues and worker pools:
[  732.758915][    C0] workqueue events_power_efficient: flags=0x80
[  732.768737][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  732.768923][    C0]     pending: check_lifetime
[  732.806456][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  763.438485][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 342s!
[  763.461942][    C0] Showing busy workqueues and worker pools:
[  763.472996][    C0] workqueue events_power_efficient: flags=0x80
[  763.484138][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  763.484304][    C0]     pending: check_lifetime
[  763.484727][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  794.158571][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 372s!
[  794.174418][    C0] Showing busy workqueues and worker pools:
[  794.188676][    C0] workqueue events_power_efficient: flags=0x80
[  794.199823][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  794.199990][    C0]     pending: check_lifetime
[  794.200477][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  824.878523][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 403s!
[  824.898973][    C0] Showing busy workqueues and worker pools:
[  824.915273][    C0] workqueue events_power_efficient: flags=0x80
[  824.926393][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  824.926581][    C0]     pending: check_lifetime
[  824.927107][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  855.598509][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 434s!
[  855.628235][    C0] Showing busy workqueues and worker pools:
[  855.638889][    C0] workqueue events_power_efficient: flags=0x80
[  855.651558][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  855.651727][    C0]     pending: check_lifetime
[  855.652271][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  886.318451][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 464s!
[  886.343400][    C0] Showing busy workqueues and worker pools:
[  886.353541][    C0] workqueue events_power_efficient: flags=0x80
[  886.372345][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  886.372510][    C0]     pending: check_lifetime
[  886.373057][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  917.038624][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 495s!
[  917.061144][    C0] Showing busy workqueues and worker pools:
[  917.072203][    C0] workqueue events_power_efficient: flags=0x80
[  917.094365][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  917.094591][    C0]     pending: check_lifetime
[  917.095106][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  947.758605][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 526s!
[  947.794471][    C0] Showing busy workqueues and worker pools:
[  947.806830][    C0] workqueue events_power_efficient: flags=0x80
[  947.819558][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  947.819811][    C0]     pending: check_lifetime
[  947.820305][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[  978.478528][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 557s!
[  978.503662][    C0] Showing busy workqueues and worker pools:
[  978.514379][    C0] workqueue events_power_efficient: flags=0x80
[  978.525533][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  978.525723][    C0]     pending: check_lifetime
[  978.548149][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[ 1009.198483][    C0] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 587s!
[ 1009.221412][    C0] Showing busy workqueues and worker pools:
[ 1009.237792][    C0] workqueue events_power_efficient: flags=0x80
[ 1009.269282][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1009.269477][    C0]     pending: check_lifetime
[ 1009.290321][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[ 1029.678914][   T30] INFO: task swapper/0:1 blocked for more than 491 seconds.
[ 1029.699579][   T30]       Not tainted 6.5.0-rc1-00004-g5d248bb39fe1 #1
[ 1029.718868][   T30] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1029.740371][   T30] task:swapper/0       state:D stack:0     pid:1     ppid:0      flags:0x00004000
[ 1029.767948][   T30] Call Trace:
[ 1029.773801][   T30]  <TASK>
[ 1029.778983][ T30] __schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710) 
[ 1029.791567][ T30] ? io_schedule_timeout (kernel/sched/core.c:6592) 
[ 1029.800242][ T30] ? __lock_acquire (kernel/locking/lockdep.c:5144) 
[ 1029.808382][ T30] schedule (kernel/sched/core.c:6787 (discriminator 1)) 
[ 1029.815057][ T30] schedule_timeout (kernel/time/timer.c:2144) 
[ 1029.822775][ T30] ? mark_lock (arch/x86/include/asm/bitops.h:228 (discriminator 3) arch/x86/include/asm/bitops.h:240 (discriminator 3) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 3) kernel/locking/lockdep.c:228 (discriminator 3) kernel/locking/lockdep.c:4663 (discriminator 3)) 
[ 1029.833948][ T30] ? usleep_range_state (kernel/time/timer.c:2129) 
[ 1029.842598][ T30] ? check_preemption_disabled (lib/smp_processor_id.c:16) 
[ 1029.851647][ T30] ? _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202) 
[ 1029.860296][ T30] ? wait_for_completion (kernel/sched/completion.c:85 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138) 
[ 1029.869302][ T30] wait_for_completion (kernel/sched/completion.c:86 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138) 
[ 1029.877507][ T30] ? wait_for_completion_io (kernel/sched/completion.c:137) 
[ 1029.886807][ T30] __cpuhp_kick_ap (kernel/cpu.c:737) 
[ 1029.894385][ T30] cpuhp_issue_call (kernel/cpu.c:1123 kernel/cpu.c:2324) 
[ 1029.907022][ T30] __cpuhp_setup_state_cpuslocked (kernel/cpu.c:2473) 
[ 1029.917563][ T30] ? rcu_read_lock_held (kernel/rcu/update.c:380) 
[ 1029.926533][ T30] ? xz_dec_bcj_reset (lib/percpu_counter.c:201) 
[ 1029.934476][ T30] __cpuhp_setup_state (include/linux/percpu-rwsem.h:99 kernel/cpu.c:501 kernel/cpu.c:2503) 
[ 1029.942992][ T30] ? libcrc32c_mod_init (lib/percpu_counter.c:258) 
[ 1029.951358][ T30] percpu_counter_startup (lib/percpu_counter.c:263) 
[ 1029.959755][ T30] do_one_initcall (init/main.c:1232) 
[ 1029.967660][ T30] ? trace_event_raw_event_initcall_level (init/main.c:1223) 
[ 1029.978323][ T30] ? parameq (kernel/params.c:171) 
[ 1029.985581][ T30] ? check_preemption_disabled (lib/smp_processor_id.c:16) 
[ 1029.994714][ T30] ? check_preemption_disabled (lib/smp_processor_id.c:16) 
[ 1030.004702][ T30] kernel_init_freeable (init/main.c:1293 init/main.c:1310 init/main.c:1329 init/main.c:1546) 
[ 1030.013031][ T30] ? rest_init (init/main.c:1429) 
[ 1030.020898][ T30] kernel_init (init/main.c:1439) 
[ 1030.027964][ T30] ? rest_init (init/main.c:1429) 
[ 1030.035129][ T30] ret_from_fork (arch/x86/entry/entry_64.S:314) 
[ 1030.047506][   T30]  </TASK>
[ 1030.052818][   T30]
[ 1030.052818][   T30] Showing all locks held in the system:
[ 1030.065360][   T30] 2 locks held by swapper/0/1:
[ 1030.073338][ T30] #0: ffffffff84b35190 (cpu_hotplug_lock){++++}-{0:0}, at: percpu_counter_startup (lib/percpu_counter.c:263) 
[ 1030.091412][ T30] #1: ffffffff84b37848 (cpuhp_state_mutex){+.+.}-{3:3}, at: __cpuhp_setup_state_cpuslocked (kernel/cpu.c:2278 kernel/cpu.c:2449) 
[ 1030.109624][   T30] 1 lock held by rcu_tasks_kthre/11:
[ 1030.117928][ T30] #0: ffffffff84f38e50 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp (kernel/rcu/tasks.h:525) 
[ 1030.134843][   T30] 1 lock held by rcu_tasks_trace/12:
[ 1030.143010][ T30] #0: ffffffff84f38b50 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp (kernel/rcu/tasks.h:525) 
[ 1030.163969][   T30] 1 lock held by khungtaskd/30:
[ 1030.180109][ T30] #0: ffffffff84f39aa0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks (kernel/locking/lockdep.c:6615) 
[ 1030.200232][   T30]
[ 1030.204126][   T30] =============================================
[ 1030.204126][   T30]
[ 1030.217675][   T30] Kernel panic - not syncing: hung_task: blocked tasks
[ 1030.221472][   T30] CPU: 0 PID: 30 Comm: khungtaskd Not tainted 6.5.0-rc1-00004-g5d248bb39fe1 #1
[ 1030.221472][   T30] Call Trace:
[ 1030.221472][   T30]  <TASK>
[ 1030.221472][ T30] dump_stack_lvl (lib/dump_stack.c:107) 
[ 1030.221472][ T30] dump_stack (lib/dump_stack.c:114) 
[ 1030.221472][ T30] panic (kernel/panic.c:340) 
[ 1030.221472][ T30] ? panic_smp_self_stop+0x100/0x100 
[ 1030.221472][ T30] watchdog (kernel/hung_task.c:195 kernel/hung_task.c:379) 
[ 1030.221472][ T30] kthread (kernel/kthread.c:389) 
[ 1030.221472][ T30] ? calculate_sigpending (kernel/signal.c:200) 
[ 1030.221472][ T30] ? proc_dohung_task_timeout_secs (kernel/hung_task.c:362) 
[ 1030.221472][ T30] ? kthread_complete_and_exit (kernel/kthread.c:342) 
[ 1030.221472][ T30] ret_from_fork (arch/x86/entry/entry_64.S:314) 
[ 1030.221472][   T30]  </TASK>
[ 1030.221472][   T30] Kernel Offset: disabled

The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231010/202310101456.53664ae-oliver.sang@xxxxxxxxx

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki