> On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote: > > A number of nodes which are used in the alloc/free paths is > > set based on num_possible_cpus() in a system. Please note a > > high limit threshold though is fixed and corresponds to 128 > > nodes. > > Large CPU count machines are NUMA machines. ALl of the allocation > and reclaim is NUMA node based i.e. a pgdat per NUMA node. > > Shrinkers are also able to be run in a NUMA aware mode so that > per-node structures can be reclaimed similar to how per-node LRU > lists are scanned for reclaim. > > Hence I'm left to wonder if it would be better to have a vmalloc > area per pgdat (or sub-node cluster) rather than just base the > number on CPU count and then have an arbitrary maximum number when > we get to 128 CPU cores. We can have 128 CPU cores in a > single socket these days, so not being able to scale the vmalloc > areas beyond a single socket seems like a bit of a limitation. > > > Hence I'm left to wonder if it would be better to have a vmalloc > area per pgdat (or sub-node cluster) rather than just base the > > Scaling out the vmalloc areas in a NUMA aware fashion allows the > shrinker to be run in numa aware mode, which gets rid of the need > for the global shrinker to loop over every single vmap area in every > shrinker invocation. Only the vm areas on the node that has a memory > shortage need to be scanned and reclaimed, it doesn't need reclaim > everything globally when a single node runs out of memory. > > Yes, this may not give quite as good microbenchmark scalability > results, but being able to locate each vm area in node local memory > and have operation on them largely isolated to node-local tasks and > vmalloc area reclaim will work much better on large multi-socket > NUMA machines. > Currently i fix the max nodes number to 128. This is because i do not have an access to such big NUMA systems whereas i do have an access to around ~128 ones. That is why i have decided to stop on that number as of now. We can easily set nr_nodes to num_possible_cpus() and let it scale for anyone. But before doing this, i would like to give it a try as a first step because i have not tested it well on really big NUMA systems. Thanks for you NUMA-aware input. -- Uladzislau Rezki