Re: [PATCH v2] vmstat: disable vmstat_work on vmstat_cpu_down_prep()

Koichiro Den <koichiro.den@xxxxxxxxxxxxx> · Mon, 6 Jan 2025 22:53:42 +0900

On Mon, Jan 06, 2025 at 10:03:57PM +0900, Koichiro Den wrote:
> On Mon, Jan 06, 2025 at 10:52:37AM +0000, Lorenzo Stoakes wrote:
> > +cc tglx, peterz for insight on CPU hot plug
> > 
> > On Sat, Jan 04, 2025 at 01:00:17PM +0900, Koichiro Den wrote:
> > > On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
> > > > On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
> > > > > Even after mm/vmstat:online teardown, shepherd may still queue work for
> > > > > the dying cpu until the cpu is removed from online mask. While it's
> > > > > quite rare, this means that after unbind_workers() unbinds a per-cpu
> > > > > kworker, it potentially runs vmstat_update for the dying CPU on an
> > > > > irrelevant cpu before entering atomic AP states.
> > > > > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
> > > > > backtrace.
> > > > >
> > > > >   BUG: using smp_processor_id() in preemptible [00000000] code: \
> > > > >                                                kworker/7:3/1702
> > > > >   caller is refresh_cpu_vm_stats+0x235/0x5f0
> > > > >   CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
> > > > >   Tainted: [N]=TEST
> > > > >   Workqueue: mm_percpu_wq vmstat_update
> > > > >   Call Trace:
> > > > >    <TASK>
> > > > >    dump_stack_lvl+0x8d/0xb0
> > > > >    check_preemption_disabled+0xce/0xe0
> > > > >    refresh_cpu_vm_stats+0x235/0x5f0
> > > > >    vmstat_update+0x17/0xa0
> > > > >    process_one_work+0x869/0x1aa0
> > > > >    worker_thread+0x5e5/0x1100
> > > > >    kthread+0x29e/0x380
> > > > >    ret_from_fork+0x2d/0x70
> > > > >    ret_from_fork_asm+0x1a/0x30
> > > > >    </TASK>
> > > > >
> > > > > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
> > > > > symmetrically enable it on startup.
> > > > >
> > > > > Signed-off-by: Koichiro Den <koichiro.den@xxxxxxxxxxxxx>
> > > >
> > > > Hi,
> > > >
> > > > I observed a warning in my qemu and real hardware, which I bisected to this commit:
> > > >
> > > > [    0.087733] ------------[ cut here ]------------
> > > > [    0.087733] workqueue: work disable count underflowed
> > > > [    0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
> > > >
> > > > This is:
> > > >
> > > > static void work_offqd_enable(struct work_offq_data *offqd)
> > > > {
> > > > 	if (likely(offqd->disable > 0))
> > > > 		offqd->disable--;
> > > > 	else
> > > > 		WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
> > > > }
> > > >
> > > > So (based on this code) presumably an enable is only required if previously
> > > > disabled, and this code is being called on startup unconditionally without
> > > > the work having been disabled previously? I'm not hugely familiar with
> > > > delayed workqueue implementation details.
> > > >
> > > > [    0.087733] Modules linked in:
> > > > [    0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
> > > > [    0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> > > > [    0.087733] RIP: 0010:enable_work+0xb5/0xc0
> > > > [    0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
> > > > [    0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
> > > > [    0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
> > > > [    0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
> > > > [    0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
> > > > [    0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
> > > > [    0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
> > > > [    0.087733] FS:  0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
> > > > [    0.087733] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [    0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
> > > > [    0.087733] PKRU: 55555554
> > > > [    0.087733] Call Trace:
> > > > [    0.087733]  <TASK>
> > > > [    0.087733]  ? enable_work+0xb5/0xc0
> > > > [    0.087733]  ? __warn.cold+0x93/0xf2
> > > > [    0.087733]  ? enable_work+0xb5/0xc0
> > > > [    0.087733]  ? report_bug+0xff/0x140
> > > > [    0.087733]  ? handle_bug+0x54/0x90
> > > > [    0.087733]  ? exc_invalid_op+0x17/0x70
> > > > [    0.087733]  ? asm_exc_invalid_op+0x1a/0x20
> > > > [    0.087733]  ? __pfx_vmstat_cpu_online+0x10/0x10
> > > > [    0.087733]  ? enable_work+0xb5/0xc0
> > > > [    0.087733]  vmstat_cpu_online+0x5c/0x70
> > > > [    0.087733]  cpuhp_invoke_callback+0x133/0x440
> > > > [    0.087733]  cpuhp_thread_fun+0x95/0x150
> > > > [    0.087733]  smpboot_thread_fn+0xd5/0x1d0
> > > > [    0.087734]  ? __pfx_smpboot_thread_fn+0x10/0x10
> > > > [    0.087735]  kthread+0xc8/0xf0
> > > > [    0.087737]  ? __pfx_kthread+0x10/0x10
> > > > [    0.087738]  ret_from_fork+0x2c/0x50
> > > > [    0.087739]  ? __pfx_kthread+0x10/0x10
> > > > [    0.087740]  ret_from_fork_asm+0x1a/0x30
> > > > [    0.087742]  </TASK>
> > > > [    0.087742] ---[ end trace 0000000000000000 ]---
> > > >
> > > >
> > > > > ---
> > > > > v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@xxxxxxxxxxxxx/
> > > > > ---
> > > > >  mm/vmstat.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > > > index 4d016314a56c..0889b75cef14 100644
> > > > > --- a/mm/vmstat.c
> > > > > +++ b/mm/vmstat.c
> > > > > @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
> > > > >  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
> > > > >  		node_set_state(cpu_to_node(cpu), N_CPU);
> > > > >  	}
> > > > > +	enable_delayed_work(&per_cpu(vmstat_work, cpu));
> > > >
> > > > Probably needs to be 'if disabled' here, as this is invoked on normal
> > > > startup when the work won't have been disabled?
> > > >
> > > > Had a brief look at code and couldn't see how that could be done
> > > > however... and one would need to be careful about races... Tricky!
> > > >
> > > > >
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > >  static int vmstat_cpu_down_prep(unsigned int cpu)
> > > > >  {
> > > > > -	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > > > > +	disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > > --
> > > > > 2.43.0
> > > > >
> > > > >
> > > >
> > > > Let me know if you need any more details, .config etc.
> > > >
> > > > I noticed this warning on a real box too (in both cases running akpm's
> > > > mm-unstable branch), FWIW.
> > >
> > > Thank you for the report. I was able to reproduce the warning and now
> > > wonder how I missed it.. My oversight, apologies.
> > >
> > > In my current view, the simplest solution would be to make sure a local
> > > vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
> > > during boot-up. The following patch suppresses the warning:
> > >
> > >   diff --git a/mm/vmstat.c b/mm/vmstat.c
> > >   index 0889b75cef14..19ceed5d34bf 100644
> > >   --- a/mm/vmstat.c
> > >   +++ b/mm/vmstat.c
> > >   @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
> > >    {
> > >           int cpu;
> > >
> > >   -       for_each_possible_cpu(cpu)
> > >   +       for_each_possible_cpu(cpu) {
> > >                   INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
> > >                           vmstat_update);
> > >
> > >   +               /* will be enabled on vmstat_cpu_online */
> > >   +               disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > >   +       }
> > >   +
> > >           schedule_delayed_work(&shepherd,
> > >                   round_jiffies_relative(sysctl_stat_interval));
> > >    }
> > >
> > > If you think of a better solution later, please let me know. Otherwise,
> > > I'll submit a follow-up fix patch with the above diff.
> > 
> > Thanks, this resolves the problem, but are we sure that _all_ CPUs will
> > definitely call vmstat_cpu_online()?
> > 
> > I did a bit of printk output and it seems like this _didn't_ online CPU 0,
> > presumably the boot CPU which calls this function in the first instance?
> > 
> > I also see that init_mm_internals() invokes cpuhp_setup_state_nocalls()
> > explicitly which does _not_ call the callback, though even if it did this
> > would be too early as it calls start_shepherd_timer() _after_ this anyway.
> > 
> > So yeah, unless I'm missing something, I think this patch is broken.
> 
> You're absolutely right, thanks a lot.
> I also appreciate you testing it, thanks everyone.
> 
> > 
> > I have added Thomas and Peter to give some insight on the CPU hotplug side.
> > 
> > It feels like the patch really needs an 'enable if not already enabled'
> > call in vmstat_cpu_online().
> 
> Right. While not fully polished yet, I've tested the following diff:
> 
>   diff --git a/mm/vmstat.c b/mm/vmstat.c
>   index 0889b75cef14..f967aa22392f 100644
>   --- a/mm/vmstat.c
>   +++ b/mm/vmstat.c
>   @@ -1937,7 +1937,11 @@ static const struct seq_operations vmstat_op = {
>    #endif /* CONFIG_PROC_FS */
>   
>    #ifdef CONFIG_SMP
>   -static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
>   +struct vmstat_work {
>   +       struct delayed_work dwork;
>   +       bool enabled;
>   +};
>   +static DEFINE_PER_CPU(struct vmstat_work, vmstat_work);
>    int sysctl_stat_interval __read_mostly = HZ;
>    static int vmstat_late_init_done;
>   
>   @@ -2015,7 +2019,7 @@ static void vmstat_update(struct work_struct *w)
>                    * update worker thread.
>                    */
>                   queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>   -                               this_cpu_ptr(&vmstat_work),
>   +                               &this_cpu_ptr(&vmstat_work)->dwork,
>                                   round_jiffies_relative(sysctl_stat_interval));
>           }
>    }
>   @@ -2059,7 +2063,7 @@ void quiet_vmstat(void)
>           if (system_state != SYSTEM_RUNNING)
>                   return;
>   
>   -       if (!delayed_work_pending(this_cpu_ptr(&vmstat_work)))
>   +       if (!delayed_work_pending(&this_cpu_ptr(&vmstat_work)->dwork))
>                   return;
>   
>           if (!need_update(smp_processor_id()))
>   @@ -2091,7 +2095,7 @@ static void vmstat_shepherd(struct work_struct *w)
>           cpus_read_lock();
>           /* Check processors whose vmstat worker threads have been disabled */
>           for_each_online_cpu(cpu) {
>   -               struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
>   +               struct delayed_work *dw = &per_cpu(vmstat_work, cpu).dwork;
>   
>                   /*
>                    * In kernel users of vmstat counters either require the precise value and
>   @@ -2120,11 +2124,14 @@ static void vmstat_shepherd(struct work_struct *w)
>   
>    static void __init start_shepherd_timer(void)
>    {
>   +       struct vmstat_work *vmstat;
>           int cpu;
>   
>   -       for_each_possible_cpu(cpu)
>   -               INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
>   -                       vmstat_update);
>   +       for_each_possible_cpu(cpu) {
>   +               vmstat = per_cpu_ptr(&vmstat_work, cpu);
>   +               INIT_DEFERRABLE_WORK(&vmstat->dwork, vmstat_update);
>   +               vmstat->enabled = true;
>   +       }
>   
>           schedule_delayed_work(&shepherd,
>                   round_jiffies_relative(sysctl_stat_interval));
>   @@ -2142,20 +2149,28 @@ static void __init init_cpu_node_state(void)
>   
>    static int vmstat_cpu_online(unsigned int cpu)
>    {
>   +       struct vmstat_work *vmstat = per_cpu_ptr(&vmstat_work, cpu);
>   +
>           if (vmstat_late_init_done)
>                   refresh_zone_stat_thresholds();
>   
>           if (!node_state(cpu_to_node(cpu), N_CPU)) {
>                   node_set_state(cpu_to_node(cpu), N_CPU);
>           }
>   -       enable_delayed_work(&per_cpu(vmstat_work, cpu));
>   +
>   +       if (!vmstat->enabled) {
>   +               enable_delayed_work(&vmstat->dwork);
>   +               vmstat->enabled = true;
>   +       }
>   
>           return 0;
>    }
>   
>    static int vmstat_cpu_down_prep(unsigned int cpu)
>    {
>   -       disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>   +       struct vmstat_work *vmstat = per_cpu_ptr(&vmstat_work, cpu);
>   +
>   +       disable_delayed_work_sync(&vmstat->dwork);

The line `vmstat->enabled = false;` was missing here.

With that fixed, I've conducted the following tests:
* Boot-up check: confirmed that vmstat_update() runs on all cpus after boot-up.
* Cpuhp transitions: verified resilience during random CPU hotplug
  transitions using the steps below:
  1. enabled dynamic ftrace on vmstat_update()
  2. ran the following simple silly script for a while
     #!/bin/bash
     while :; do
       now=$(cat /sys/devices/system/cpu/cpu7/hotplug/state)
       next=$(shuf --random-source=/dev/urandom -i 0-238 -n 1)
       echo "${now} -> ${next}"
       echo $next > /sys/devices/system/cpu/cpu7/hotplug/target 2>/dev/null
       sleep 0.001
     done

  3. after stopping the script, ensured that the cpu was fully online again
     by writing the largest value to the 'state' file.
  4. verified that vmstat_update() was working on the cpu.

Let me know if you have any suggestions or concerns.

Thanks.

-Koichiro Den

>           return 0;
>    }
> 
> Lorenzo, and Thomas and Peter if you're available, I'd greatly appreciate
> any thoughts or feedback on this.
> 
> Thanks.
> 
> -Koichiro Den
> 
> 
> > 
> > >
> > > Thanks.
> > >
> > > -Koichiro