Re: [PATCH v3 10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system

Uladzislau Rezki <urezki@xxxxxxxxx> · Mon, 15 Jan 2024 20:09:29 +0100

> On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote:
> > A number of nodes which are used in the alloc/free paths is
> > set based on num_possible_cpus() in a system. Please note a
> > high limit threshold though is fixed and corresponds to 128
> > nodes.
> 
> Large CPU count machines are NUMA machines. ALl of the allocation
> and reclaim is NUMA node based i.e. a pgdat per NUMA node.
> 
> Shrinkers are also able to be run in a NUMA aware mode so that
> per-node structures can be reclaimed similar to how per-node LRU
> lists are scanned for reclaim.
> 
> Hence I'm left to wonder if it would be better to have a vmalloc
> area per pgdat (or sub-node cluster) rather than just base the
> number on CPU count and then have an arbitrary maximum number when
> we get to 128 CPU cores. We can have 128 CPU cores in a
> single socket these days, so not being able to scale the vmalloc
> areas beyond a single socket seems like a bit of a limitation.
> 
>
> Hence I'm left to wonder if it would be better to have a vmalloc
> area per pgdat (or sub-node cluster) rather than just base the
>
> Scaling out the vmalloc areas in a NUMA aware fashion allows the
> shrinker to be run in numa aware mode, which gets rid of the need
> for the global shrinker to loop over every single vmap area in every
> shrinker invocation. Only the vm areas on the node that has a memory
> shortage need to be scanned and reclaimed, it doesn't need reclaim
> everything globally when a single node runs out of memory.
> 
> Yes, this may not give quite as good microbenchmark scalability
> results, but being able to locate each vm area in node local memory
> and have operation on them largely isolated to node-local tasks and
> vmalloc area reclaim will work much better on large multi-socket
> NUMA machines.
> 
Currently i fix the max nodes number to 128. This is because i do not
have an access to such big NUMA systems whereas i do have an access to
around ~128 ones. That is why i have decided to stop on that number as
of now.

We can easily set nr_nodes to num_possible_cpus() and let it scale for
anyone. But before doing this, i would like to give it a try as a first
step because i have not tested it well on really big NUMA systems.

Thanks for you NUMA-aware input.

--
Uladzislau Rezki