On 8/12/24 11:43, Saurabh Sengar wrote: > refresh_zone_stat_thresholds function has two loops which is expensive for > higher number of CPUs and NUMA nodes. > > Below is the rough estimation of total iterations done by these loops > based on number of NUMA and CPUs. > > Total number of iterations: nCPU * 2 * Numa * mCPU > Where: > nCPU = total number of CPUs > Numa = total number of NUMA nodes > mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs) > > For the system under test with 16 NUMA nodes and 1024 CPUs, this > results in a substantial increase in the number of loop iterations > during boot-up when NUMA is enabled: > > No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds > takes around 224 ms total for all the CPUs in the system under test. > 16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds > takes around 4.5 seconds total for all the CPUs in the system under test. > > Calling this for each CPU is expensive when there are large number > of CPUs along with multiple NUMAs. Fix this by deferring > refresh_zone_stat_thresholds to be called later at once when all the > secondary CPUs are up. Also, register the DYN hooks to keep the > existing hotplug functionality intact. > > Signed-off-by: Saurabh Sengar <ssengar@xxxxxxxxxxxxxxxxxxx> > --- > [V2] > - Move vmstat_late_init_done under CONFIG_SMP to fix > variable 'defined but not used' warning. > > mm/vmstat.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 4e2dc067a654..fa235c65c756 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1908,6 +1908,7 @@ static const struct seq_operations vmstat_op = { > #ifdef CONFIG_SMP > static DEFINE_PER_CPU(struct delayed_work, vmstat_work); > int sysctl_stat_interval __read_mostly = HZ; > +static int vmstat_late_init_done; > > #ifdef CONFIG_PROC_FS > static void refresh_vm_stats(struct work_struct *work) > @@ -2110,7 +2111,8 @@ static void __init init_cpu_node_state(void) > > static int vmstat_cpu_online(unsigned int cpu) > { > - refresh_zone_stat_thresholds(); > + if (vmstat_late_init_done) > + refresh_zone_stat_thresholds(); > > if (!node_state(cpu_to_node(cpu), N_CPU)) { > node_set_state(cpu_to_node(cpu), N_CPU); > @@ -2142,6 +2144,14 @@ static int vmstat_cpu_dead(unsigned int cpu) > return 0; > } > > +static int __init vmstat_late_init(void) > +{ > + refresh_zone_stat_thresholds(); > + vmstat_late_init_done = 1; > + > + return 0; > +} > +late_initcall(vmstat_late_init);> #endif > > struct workqueue_struct *mm_percpu_wq; late_initcall() triggered vmstat_late_init() guaranteed to be called before the last call into vmstat_cpu_online() during a normal boot ? Otherwise refresh_zone_stat_thresholds() will never be called unless there is a CPU online event later.