Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Fri, 5 Jul 2024 13:59:11 -0700

On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@xxxxxxxxxxxxxxxxxxx> wrote:

> refresh_zone_stat_thresholds function has two loops which is expensive for
> higher number of CPUs and NUMA nodes.
> 
> Below is the rough estimation of total iterations done by these loops
> based on number of NUMA and CPUs.
> 
> Total number of iterations: nCPU * 2 * Numa * mCPU
> Where:
>  nCPU = total number of CPUs
>  Numa = total number of NUMA nodes
>  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> 
> For the system under test with 16 NUMA nodes and 1024 CPUs, this
> results in a substantial increase in the number of loop iterations
> during boot-up when NUMA is enabled:
> 
> No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> takes around 224 ms total for all the CPUs in the system under test.
> 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> takes around 4.5 seconds total for all the CPUs in the system under test.

Did you measure the overall before-and-after times?  IOW, how much of
that 4.5s do we reclaim?

> Calling this for each CPU is expensive when there are large number
> of CPUs along with multiple NUMAs. Fix this by deferring
> refresh_zone_stat_thresholds to be called later at once when all the
> secondary CPUs are up. Also, register the DYN hooks to keep the
> existing hotplug functionality intact.
> 

Seems risky - we'll now have online CPUs which have unintialized data,
yes?  What assurance do we have that this data won't be accessed?

Another approach might be to make the code a bit smarter - instead of
calculating thresholds for the whole world, we make incremental changes
to the existing thresholds on behalf of the new resource which just
became available?