On Thu, Aug 08, 2024 at 10:20:06PM -0700, Andrew Morton wrote: > On Fri, 5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@xxxxxxxxxxxxxxxxxxx> wrote: > > > refresh_zone_stat_thresholds function has two loops which is expensive for > > higher number of CPUs and NUMA nodes. > > > > Below is the rough estimation of total iterations done by these loops > > based on number of NUMA and CPUs. > > > > Total number of iterations: nCPU * 2 * Numa * mCPU > > Where: > > nCPU = total number of CPUs > > Numa = total number of NUMA nodes > > mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs) > > > > For the system under test with 16 NUMA nodes and 1024 CPUs, this > > results in a substantial increase in the number of loop iterations > > during boot-up when NUMA is enabled: > > > > No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds > > takes around 224 ms total for all the CPUs in the system under test. > > 16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds > > takes around 4.5 seconds total for all the CPUs in the system under test. > > > > Calling this for each CPU is expensive when there are large number > > of CPUs along with multiple NUMAs. Fix this by deferring > > refresh_zone_stat_thresholds to be called later at once when all the > > secondary CPUs are up. Also, register the DYN hooks to keep the > > existing hotplug functionality intact. > > > > ... > > > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -31,6 +31,7 @@ > > > > #include "internal.h" > > > > +static int vmstat_late_init_done; > > #ifdef CONFIG_NUMA > > int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; > > > > @@ -2107,7 +2108,8 @@ static void __init init_cpu_node_state(void) > > > > static int vmstat_cpu_online(unsigned int cpu) > > { > > - refresh_zone_stat_thresholds(); > > + if (vmstat_late_init_done) > > + refresh_zone_stat_thresholds(); > > > > if (!node_state(cpu_to_node(cpu), N_CPU)) { > > node_set_state(cpu_to_node(cpu), N_CPU); > > @@ -2139,6 +2141,14 @@ static int vmstat_cpu_dead(unsigned int cpu) > > return 0; > > } > > > > +static int __init vmstat_late_init(void) > > +{ > > + refresh_zone_stat_thresholds(); > > + vmstat_late_init_done = 1; > > + > > + return 0; > > +} > > +late_initcall(vmstat_late_init); > > OK, so what's happening here. Once all CPUs are online and running > around doing heaven knows what, one of the CPUs sets up everyone's > thresholds. So for a period, all the other CPUs are running with > inappropriate threshold values. > > So what are all the other CPUs doing at this point in time, and why is > it safe to leave their thresholds in an inappropriate state while they > are doing it?