Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup

Saurabh Singh Sengar <ssengar@xxxxxxxxxxxxxxxxxxx> · Thu, 8 Aug 2024 22:49:56 -0700



On Thu, Aug 08, 2024 at 10:20:06PM -0700, Andrew Morton wrote:
> On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@xxxxxxxxxxxxxxxxxxx> wrote:
> 
> > refresh_zone_stat_thresholds function has two loops which is expensive for
> > higher number of CPUs and NUMA nodes.
> > 
> > Below is the rough estimation of total iterations done by these loops
> > based on number of NUMA and CPUs.
> > 
> > Total number of iterations: nCPU * 2 * Numa * mCPU
> > Where:
> >  nCPU = total number of CPUs
> >  Numa = total number of NUMA nodes
> >  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> > 
> > For the system under test with 16 NUMA nodes and 1024 CPUs, this
> > results in a substantial increase in the number of loop iterations
> > during boot-up when NUMA is enabled:
> > 
> > No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> > takes around 224 ms total for all the CPUs in the system under test.
> > 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> > takes around 4.5 seconds total for all the CPUs in the system under test.
> > 
> > Calling this for each CPU is expensive when there are large number
> > of CPUs along with multiple NUMAs. Fix this by deferring
> > refresh_zone_stat_thresholds to be called later at once when all the
> > secondary CPUs are up. Also, register the DYN hooks to keep the
> > existing hotplug functionality intact.
> >
> > ...
> >
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -31,6 +31,7 @@
> >  
> >  #include "internal.h"
> >  
> > +static int vmstat_late_init_done;
> >  #ifdef CONFIG_NUMA
> >  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
> >  
> > @@ -2107,7 +2108,8 @@ static void __init init_cpu_node_state(void)
> >  
> >  static int vmstat_cpu_online(unsigned int cpu)
> >  {
> > -	refresh_zone_stat_thresholds();
> > +	if (vmstat_late_init_done)
> > +		refresh_zone_stat_thresholds();
> >  
> >  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
> >  		node_set_state(cpu_to_node(cpu), N_CPU);
> > @@ -2139,6 +2141,14 @@ static int vmstat_cpu_dead(unsigned int cpu)
> >  	return 0;
> >  }
> >  
> > +static int __init vmstat_late_init(void)
> > +{
> > +	refresh_zone_stat_thresholds();
> > +	vmstat_late_init_done = 1;
> > +
> > +	return 0;
> > +}
> > +late_initcall(vmstat_late_init);
> 
> OK, so what's happening here.  Once all CPUs are online and running
> around doing heaven knows what, one of the CPUs sets up everyone's
> thresholds.  So for a period, all the other CPUs are running with
> inappropriate threshold values.
> 
> So what are all the other CPUs doing at this point in time, and why is
> it safe to leave their thresholds in an inappropriate state while they
> are doing it?