On 2021/01/25 8:48, Alexey Kardashevskiy wrote: > > > On 23/01/2021 22:17, Tetsuo Handa wrote: >> On 2021/01/23 12:27, Cong Wang wrote: >>> On Fri, Jan 22, 2021 at 4:42 PM Tetsuo Handa >>> <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote: >>>> >>>> Hello, BPF developers. >>>> >>>> Alexey Kardashevskiy is reporting that system_wq gets stuck due to flooding of >>>> unbounded bpf_map_free_deferred work. Use of WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND >>>> workqueue did not solve this problem. Is it possible that a refcount leak somewhere >>>> preventing bpf_map_free_deferred from completing? Please see >>>> https://lkml.kernel.org/r/CACT4Y+Z+kwPM=WUzJ-e359PWeLLqmF0w4Yxp1spzZ=+J0ekrag@xxxxxxxxxxxxxx . >>>> >>> >>> Which map does the reproducer create? And where exactly do >>> those work block on? >>> >>> Different map->ops->map_free() waits for different reasons, >>> for example, htab_map_free() waits for flying htab_elem_free_rcu(). >>> I can't immediately see how they could wait for each other, if this >>> is what you meant above. I guess that rcu_barrier() in htab_map_free() is taking longer than it should, for Alexey said The offender is htab->lockdep_key. If I run repro at the slow rate, no problems appears, traces show lockdep_unregister_key() is called and the leak is quite slow. at https://lkml.kernel.org/r/c099ad52-0c2c-b886-bae2-c64bd8626452@xxxxxxxxx and pending works seems to be reduced by use of WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND workqueue at https://lkml.kernel.org/r/e4767b84-05a4-07c0-811b-b3a08cad2f43@xxxxxxxxx . >>> >>> Thanks. >>> >> >> Alexey, please try below diff on top of "show_workqueue_state() right after the bug message" >> and Hillf's patch. And please capture several times so that we can check if sched_show_task() >> always points to the same function. > > >> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c >> index 9880b6c0e272..616464dd8e92 100644 >> --- a/kernel/workqueue.c >> +++ b/kernel/workqueue.c >> @@ -50,6 +50,7 @@ >> #include <linux/uaccess.h> >> #include <linux/sched/isolation.h> >> #include <linux/nmi.h> >> +#include <linux/sched/debug.h> >> #include "workqueue_internal.h" >> @@ -4725,6 +4726,10 @@ static void show_pwq(struct pool_workqueue *pwq) >> comma = true; >> } >> pr_cont("\n"); >> + hash_for_each(pool->busy_hash, bkt, worker, hentry) { >> + if (worker->current_pwq == pwq) >> + sched_show_task(worker->task); >> + } >> } >> list_for_each_entry(work, &pool->worklist, entry) { >> > > Below are 3 samples, they report: > > Workqueue: events_unbound bpf_map_free_deferred > > Hope that helps, i'm still getting my head around this new can of worms :) Thanks, > > > > [ 148.770893] hrtimer: interrupt took 1070385 ns > [ 207.882594] BUG: MAX_LOCKDEP_KEYS too low! > [ 207.886689] turning off the locking correctness validator. > [ 207.886766] CPU: 1 PID: 9448 Comm: maxlockdep Not tainted 5.11.0-rc4-le_syzkaller_a+fstn1 #64 > [ 207.886882] Call Trace: > [ 207.886914] [c00000003b397690] [c000000000d112f0] dump_stack+0x13c/0x1bc (unreliable) > [ 207.887052] [c00000003b3976e0] [c0000000002edb0c] register_lock_class+0xcbc/0xe20 > [ 207.887156] [c00000003b3977f0] [c0000000002e97b8] __lock_acquire+0xa8/0x21e0 > [ 207.887247] [c00000003b397940] [c0000000002ec674] lock_acquire+0x2c4/0x5c0 > [ 207.887328] [c00000003b397a30] [c0000000018212dc] _raw_spin_lock_irqsave+0x7c/0xb0 > [ 207.887436] [c00000003b397a70] [c0000000004dc9d0] htab_lru_map_update_elem+0x3e0/0x6c0 > [ 207.887537] [c00000003b397af0] [c0000000004a2b14] bpf_map_update_value.isra.24+0x734/0x820 > [ 207.887631] [c00000003b397b50] [c0000000004a7d58] generic_map_update_batch+0x1b8/0x3b0 > [ 207.887723] [c00000003b397bf0] [c0000000004a05e0] bpf_map_do_batch+0x1b0/0x390 > [ 207.887816] [c00000003b397c50] [c0000000004aca90] __do_sys_bpf+0x13d0/0x35b0 > [ 207.887905] [c00000003b397db0] [c00000000004b648] system_call_exception+0x178/0x2b0 > [ 207.887998] [c00000003b397e10] [c00000000000e060] system_call_common+0xf0/0x27c > [ 207.888090] Showing busy workqueues and worker pools: > [ 207.888149] workqueue events_unbound: flags=0x2 > [ 207.888206] pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/512 refcnt=3 > [ 207.888290] in-flight: 7:bpf_map_free_deferred > [ 207.888357] ___K___ (1) show_pwq 4729 > [ 207.888403] task:kworker/u16:0 state:D stack:10928 pid: 7 ppid: 2 flags:0x00000800 > [ 207.888494] Workqueue: events_unbound bpf_map_free_deferred > [ 207.888559] Call Trace: > [ 207.888590] [c00000001343b4f0] [c000000013412b18] 0xc000000013412b18 (unreliable) > [ 207.888691] [c00000001343b6e0] [c0000000000271d0] __switch_to+0x3e0/0x700 > [ 207.888763] [c00000001343b750] [c000000001817b48] __schedule+0x3c8/0xc80 > [ 207.888828] [c00000001343b820] [c000000001818494] schedule+0x94/0x170 > [ 207.888892] [c00000001343b850] [c00000000181f9ec] schedule_timeout+0x43c/0x650 > [ 207.888966] [c00000001343b960] [c00000000181a194] wait_for_completion+0xb4/0x190 > [ 207.889041] [c00000001343b9c0] [c000000000325904] __wait_rcu_gp+0x1c4/0x240 > [ 207.889105] [c00000001343ba20] [c000000000339164] synchronize_rcu+0xa4/0x180 > [ 207.889179] [c00000001343bac0] [c0000000002e54b0] lockdep_unregister_key+0x1e0/0x470 > [ 207.889253] [c00000001343bb60] [c0000000004d942c] htab_map_free+0x20c/0x250 > [ 207.889317] [c00000001343bbc0] [c0000000004a1934] bpf_map_free_deferred+0xa4/0x3e0 > [ 207.889391] [c00000001343bc30] [c00000000026d508] process_one_work+0x468/0xb00 > [ 207.889465] [c00000001343bd10] [c00000000026dc34] worker_thread+0x94/0x760 > [ 207.889535] [c00000001343bda0] [c00000000027e380] kthread+0x1f0/0x200 > [ 207.889598] [c00000001343be10] [c00000000000e2f0] ret_from_kernel_thread+0x5c/0x6c > [ 207.889674] workqueue events_power_efficient: flags=0x80 > [ 207.889721] pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 > [ 207.889789] pending: gc_worker > [ 207.889894] pool 16: cpus=0-7 flags=0x4 nice=0 hung=76s workers=3 idle: 81 253 > > > > > [ 29.071632] maxlockdep (2535) used greatest stack depth: 7872 bytes left > [ 30.855628] hrtimer: interrupt took 1041204 ns > [ 96.748271] BUG: MAX_LOCKDEP_KEYS too low! > [ 96.780332] turning off the locking correctness validator. > [ 96.781239] CPU: 0 PID: 9443 Comm: maxlockdep Not tainted 5.11.0-rc4-le_syzkaller_a+fstn1 #64 > [ 96.782008] Call Trace: > [ 96.782571] [c00000003658b690] [c000000000d112f0] dump_stack+0x13c/0x1bc (unreliable) > [ 96.783472] [c00000003658b6e0] [c0000000002edb0c] register_lock_class+0xcbc/0xe20 > [ 96.784037] [c00000003658b7f0] [c0000000002e97b8] __lock_acquire+0xa8/0x21e0 > [ 96.784607] [c00000003658b940] [c0000000002ec674] lock_acquire+0x2c4/0x5c0 > [ 96.785067] [c00000003658ba30] [c0000000018212dc] _raw_spin_lock_irqsave+0x7c/0xb0 > [ 96.785307] [c00000003658ba70] [c0000000004dc9d0] htab_lru_map_update_elem+0x3e0/0x6c0 > [ 96.785541] [c00000003658baf0] [c0000000004a2b14] bpf_map_update_value.isra.24+0x734/0x820 > [ 96.785749] [c00000003658bb50] [c0000000004a7d58] generic_map_update_batch+0x1b8/0x3b0 > [ 96.785963] [c00000003658bbf0] [c0000000004a05e0] bpf_map_do_batch+0x1b0/0x390 > [ 96.786161] [c00000003658bc50] [c0000000004aca90] __do_sys_bpf+0x13d0/0x35b0 > [ 96.786392] [c00000003658bdb0] [c00000000004b648] system_call_exception+0x178/0x2b0 > [ 96.786614] [c00000003658be10] [c00000000000e060] system_call_common+0xf0/0x27c > [ 96.786819] Showing busy workqueues and worker pools: > [ 96.786942] workqueue events_unbound: flags=0x2 > [ 96.787062] pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/512 refcnt=3 > [ 96.787272] in-flight: 81:bpf_map_free_deferred > [ 96.787441] ___K___ (0) show_pwq 4729 > [ 96.787538] task:kworker/u16:1 state:D stack:10928 pid: 81 ppid: 2 flags:0x00000800 > [ 96.787767] Workqueue: events_unbound bpf_map_free_deferred > [ 96.787909] Call Trace: > [ 96.787975] [c000000013b57600] [c000000013b57660] 0xc000000013b57660 (unreliable) > [ 96.788222] [c000000013b577f0] [c0000000000271d0] __switch_to+0x3e0/0x700 > [ 96.788417] [c000000013b57860] [c000000001817b48] __schedule+0x3c8/0xc80 > [ 96.788595] [c000000013b57930] [c000000001818494] schedule+0x94/0x170 > [ 96.788772] [c000000013b57960] [c00000000181f9ec] schedule_timeout+0x43c/0x650 > [ 96.789002] [c000000013b57a70] [c00000000181a194] wait_for_completion+0xb4/0x190 > [ 96.789222] [c000000013b57ad0] [c000000000333e8c] rcu_barrier+0x2fc/0xc70 > [ 96.789402] [c000000013b57b60] [c0000000004d9258] htab_map_free+0x38/0x250 > [ 96.789580] [c000000013b57bc0] [c0000000004a1934] bpf_map_free_deferred+0xa4/0x3e0 > [ 96.789804] [c000000013b57c30] [c00000000026d508] process_one_work+0x468/0xb00 > [ 96.790002] [c000000013b57d10] [c00000000026dc34] worker_thread+0x94/0x760 > [ 96.790177] [c000000013b57da0] [c00000000027e380] kthread+0x1f0/0x200 > [ 96.790346] [c000000013b57e10] [c00000000000e2f0] ret_from_kernel_thread+0x5c/0x6c > [ 96.790559] workqueue events_power_efficient: flags=0x80 > [ 96.790717] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 > [ 96.790956] pending: gc_worker > [ 96.791215] pool 16: cpus=0-7 flags=0x4 nice=0 hung=76s workers=3 idle: 7 251 > > > > > > [ 21.324859] hrtimer: interrupt took 383633 ns > [ 83.653616] BUG: MAX_LOCKDEP_KEYS too low! > [ 83.653806] turning off the locking correctness validator. > [ 83.653963] CPU: 4 PID: 9460 Comm: maxlockdep Not tainted 5.11.0-rc4-le_syzkaller_a+fstn1 #64 > [ 83.654183] Call Trace: > [ 83.654250] [c00000004d53b690] [c000000000d112f0] dump_stack+0x13c/0x1bc (unreliable) > [ 83.654565] [c00000004d53b6e0] [c0000000002edb0c] register_lock_class+0xcbc/0xe20 > [ 83.654808] [c00000004d53b7f0] [c0000000002e97b8] __lock_acquire+0xa8/0x21e0 > [ 83.655020] [c00000004d53b940] [c0000000002ec674] lock_acquire+0x2c4/0x5c0 > [ 83.655203] [c00000004d53ba30] [c0000000018212dc] _raw_spin_lock_irqsave+0x7c/0xb0 > [ 83.655444] [c00000004d53ba70] [c0000000004dc9d0] htab_lru_map_update_elem+0x3e0/0x6c0 > [ 83.655673] [c00000004d53baf0] [c0000000004a2b14] bpf_map_update_value.isra.24+0x734/0x820 > [ 83.655886] [c00000004d53bb50] [c0000000004a7d58] generic_map_update_batch+0x1b8/0x3b0 > [ 83.656088] [c00000004d53bbf0] [c0000000004a05e0] bpf_map_do_batch+0x1b0/0x390 > [ 83.656297] [c00000004d53bc50] [c0000000004aca90] __do_sys_bpf+0x13d0/0x35b0 > [ 83.656496] [c00000004d53bdb0] [c00000000004b648] system_call_exception+0x178/0x2b0 > [ 83.656700] [c00000004d53be10] [c00000000000e060] system_call_common+0xf0/0x27c > [ 83.656903] Showing busy workqueues and worker pools: > [ 83.657044] workqueue events_unbound: flags=0x2 > [ 83.657165] pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/512 refcnt=3 > [ 83.657358] in-flight: 81:bpf_map_free_deferred > [ 83.657519] ___K___ (4) show_pwq 4729 > [ 83.657620] task:kworker/u16:1 state:D stack:10928 pid: 81 ppid: 2 flags:0x00000800 > [ 83.657820] Workqueue: events_unbound bpf_map_free_deferred > [ 83.657967] Call Trace: > [ 83.658033] [c000000013b574f0] [c000000013aefa98] 0xc000000013aefa98 (unreliable) > [ 83.658255] [c000000013b576e0] [c0000000000271d0] __switch_to+0x3e0/0x700 > [ 83.658437] [c000000013b57750] [c000000001817b48] __schedule+0x3c8/0xc80 > [ 83.658618] [c000000013b57820] [c000000001818494] schedule+0x94/0x170 > [ 83.658815] [c000000013b57850] [c00000000181f9ec] schedule_timeout+0x43c/0x650 > [ 83.659019] [c000000013b57960] [c00000000181a194] wait_for_completion+0xb4/0x190 > [ 83.659229] [c000000013b579c0] [c000000000325904] __wait_rcu_gp+0x1c4/0x240 > [ 83.659406] [c000000013b57a20] [c000000000339164] synchronize_rcu+0xa4/0x180 > [ 83.659617] [c000000013b57ac0] [c0000000002e54b0] lockdep_unregister_key+0x1e0/0x470 > [ 83.659825] [c000000013b57b60] [c0000000004d942c] htab_map_free+0x20c/0x250 > [ 83.660002] [c000000013b57bc0] [c0000000004a1934] bpf_map_free_deferred+0xa4/0x3e0 > [ 83.660212] [c000000013b57c30] [c00000000026d508] process_one_work+0x468/0xb00 > [ 83.660415] [c000000013b57d10] [c00000000026dc34] worker_thread+0x94/0x760 > [ 83.660598] [c000000013b57da0] [c00000000027e380] kthread+0x1f0/0x200 > [ 83.660771] [c000000013b57e10] [c00000000000e2f0] ret_from_kernel_thread+0x5c/0x6c > [ 83.661114] pool 16: cpus=0-7 flags=0x4 nice=0 hung=67s workers=3 idle: 7 252 > > > >