+cc tglx, peterz for insight on CPU hot plug On Sat, Jan 04, 2025 at 01:00:17PM +0900, Koichiro Den wrote: > On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote: > > On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote: > > > Even after mm/vmstat:online teardown, shepherd may still queue work for > > > the dying cpu until the cpu is removed from online mask. While it's > > > quite rare, this means that after unbind_workers() unbinds a per-cpu > > > kworker, it potentially runs vmstat_update for the dying CPU on an > > > irrelevant cpu before entering atomic AP states. > > > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the > > > backtrace. > > > > > > BUG: using smp_processor_id() in preemptible [00000000] code: \ > > > kworker/7:3/1702 > > > caller is refresh_cpu_vm_stats+0x235/0x5f0 > > > CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G > > > Tainted: [N]=TEST > > > Workqueue: mm_percpu_wq vmstat_update > > > Call Trace: > > > <TASK> > > > dump_stack_lvl+0x8d/0xb0 > > > check_preemption_disabled+0xce/0xe0 > > > refresh_cpu_vm_stats+0x235/0x5f0 > > > vmstat_update+0x17/0xa0 > > > process_one_work+0x869/0x1aa0 > > > worker_thread+0x5e5/0x1100 > > > kthread+0x29e/0x380 > > > ret_from_fork+0x2d/0x70 > > > ret_from_fork_asm+0x1a/0x30 > > > </TASK> > > > > > > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and > > > symmetrically enable it on startup. > > > > > > Signed-off-by: Koichiro Den <koichiro.den@xxxxxxxxxxxxx> > > > > Hi, > > > > I observed a warning in my qemu and real hardware, which I bisected to this commit: > > > > [ 0.087733] ------------[ cut here ]------------ > > [ 0.087733] workqueue: work disable count underflowed > > [ 0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0 > > > > This is: > > > > static void work_offqd_enable(struct work_offq_data *offqd) > > { > > if (likely(offqd->disable > 0)) > > offqd->disable--; > > else > > WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line > > } > > > > So (based on this code) presumably an enable is only required if previously > > disabled, and this code is being called on startup unconditionally without > > the work having been disabled previously? I'm not hugely familiar with > > delayed workqueue implementation details. > > > > [ 0.087733] Modules linked in: > > [ 0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58 > > [ 0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 > > [ 0.087733] RIP: 0010:enable_work+0xb5/0xc0 > > [ 0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 > > [ 0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092 > > [ 0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000 > > [ 0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001 > > [ 0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08 > > [ 0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4 > > [ 0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648 > > [ 0.087733] FS: 0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000 > > [ 0.087733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0 > > [ 0.087733] PKRU: 55555554 > > [ 0.087733] Call Trace: > > [ 0.087733] <TASK> > > [ 0.087733] ? enable_work+0xb5/0xc0 > > [ 0.087733] ? __warn.cold+0x93/0xf2 > > [ 0.087733] ? enable_work+0xb5/0xc0 > > [ 0.087733] ? report_bug+0xff/0x140 > > [ 0.087733] ? handle_bug+0x54/0x90 > > [ 0.087733] ? exc_invalid_op+0x17/0x70 > > [ 0.087733] ? asm_exc_invalid_op+0x1a/0x20 > > [ 0.087733] ? __pfx_vmstat_cpu_online+0x10/0x10 > > [ 0.087733] ? enable_work+0xb5/0xc0 > > [ 0.087733] vmstat_cpu_online+0x5c/0x70 > > [ 0.087733] cpuhp_invoke_callback+0x133/0x440 > > [ 0.087733] cpuhp_thread_fun+0x95/0x150 > > [ 0.087733] smpboot_thread_fn+0xd5/0x1d0 > > [ 0.087734] ? __pfx_smpboot_thread_fn+0x10/0x10 > > [ 0.087735] kthread+0xc8/0xf0 > > [ 0.087737] ? __pfx_kthread+0x10/0x10 > > [ 0.087738] ret_from_fork+0x2c/0x50 > > [ 0.087739] ? __pfx_kthread+0x10/0x10 > > [ 0.087740] ret_from_fork_asm+0x1a/0x30 > > [ 0.087742] </TASK> > > [ 0.087742] ---[ end trace 0000000000000000 ]--- > > > > > > > --- > > > v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@xxxxxxxxxxxxx/ > > > --- > > > mm/vmstat.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > index 4d016314a56c..0889b75cef14 100644 > > > --- a/mm/vmstat.c > > > +++ b/mm/vmstat.c > > > @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu) > > > if (!node_state(cpu_to_node(cpu), N_CPU)) { > > > node_set_state(cpu_to_node(cpu), N_CPU); > > > } > > > + enable_delayed_work(&per_cpu(vmstat_work, cpu)); > > > > Probably needs to be 'if disabled' here, as this is invoked on normal > > startup when the work won't have been disabled? > > > > Had a brief look at code and couldn't see how that could be done > > however... and one would need to be careful about races... Tricky! > > > > > > > > return 0; > > > } > > > > > > static int vmstat_cpu_down_prep(unsigned int cpu) > > > { > > > - cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu)); > > > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu)); > > > return 0; > > > } > > > > > > -- > > > 2.43.0 > > > > > > > > > > Let me know if you need any more details, .config etc. > > > > I noticed this warning on a real box too (in both cases running akpm's > > mm-unstable branch), FWIW. > > Thank you for the report. I was able to reproduce the warning and now > wonder how I missed it.. My oversight, apologies. > > In my current view, the simplest solution would be to make sure a local > vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even > during boot-up. The following patch suppresses the warning: > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 0889b75cef14..19ceed5d34bf 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void) > { > int cpu; > > - for_each_possible_cpu(cpu) > + for_each_possible_cpu(cpu) { > INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu), > vmstat_update); > > + /* will be enabled on vmstat_cpu_online */ > + disable_delayed_work_sync(&per_cpu(vmstat_work, cpu)); > + } > + > schedule_delayed_work(&shepherd, > round_jiffies_relative(sysctl_stat_interval)); > } > > If you think of a better solution later, please let me know. Otherwise, > I'll submit a follow-up fix patch with the above diff. Thanks, this resolves the problem, but are we sure that _all_ CPUs will definitely call vmstat_cpu_online()? I did a bit of printk output and it seems like this _didn't_ online CPU 0, presumably the boot CPU which calls this function in the first instance? I also see that init_mm_internals() invokes cpuhp_setup_state_nocalls() explicitly which does _not_ call the callback, though even if it did this would be too early as it calls start_shepherd_timer() _after_ this anyway. So yeah, unless I'm missing something, I think this patch is broken. I have added Thomas and Peter to give some insight on the CPU hotplug side. It feels like the patch really needs an 'enable if not already enabled' call in vmstat_cpu_online(). > > Thanks. > > -Koichiro