Re: [PATCH v3 10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 11 Jan 2024 20:25:06 +1100

On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote:
> A number of nodes which are used in the alloc/free paths is
> set based on num_possible_cpus() in a system. Please note a
> high limit threshold though is fixed and corresponds to 128
> nodes.

Large CPU count machines are NUMA machines. ALl of the allocation
and reclaim is NUMA node based i.e. a pgdat per NUMA node.

Shrinkers are also able to be run in a NUMA aware mode so that
per-node structures can be reclaimed similar to how per-node LRU
lists are scanned for reclaim.

Hence I'm left to wonder if it would be better to have a vmalloc
area per pgdat (or sub-node cluster) rather than just base the
number on CPU count and then have an arbitrary maximum number when
we get to 128 CPU cores. We can have 128 CPU cores in a
single socket these days, so not being able to scale the vmalloc
areas beyond a single socket seems like a bit of a limitation.

Scaling out the vmalloc areas in a NUMA aware fashion allows the
shrinker to be run in numa aware mode, which gets rid of the need
for the global shrinker to loop over every single vmap area in every
shrinker invocation. Only the vm areas on the node that has a memory
shortage need to be scanned and reclaimed, it doesn't need reclaim
everything globally when a single node runs out of memory.

Yes, this may not give quite as good microbenchmark scalability
results, but being able to locate each vm area in node local memory
and have operation on them largely isolated to node-local tasks and
vmalloc area reclaim will work much better on large multi-socket
NUMA machines.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx