On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote: > A number of nodes which are used in the alloc/free paths is > set based on num_possible_cpus() in a system. Please note a > high limit threshold though is fixed and corresponds to 128 > nodes. Large CPU count machines are NUMA machines. ALl of the allocation and reclaim is NUMA node based i.e. a pgdat per NUMA node. Shrinkers are also able to be run in a NUMA aware mode so that per-node structures can be reclaimed similar to how per-node LRU lists are scanned for reclaim. Hence I'm left to wonder if it would be better to have a vmalloc area per pgdat (or sub-node cluster) rather than just base the number on CPU count and then have an arbitrary maximum number when we get to 128 CPU cores. We can have 128 CPU cores in a single socket these days, so not being able to scale the vmalloc areas beyond a single socket seems like a bit of a limitation. Scaling out the vmalloc areas in a NUMA aware fashion allows the shrinker to be run in numa aware mode, which gets rid of the need for the global shrinker to loop over every single vmap area in every shrinker invocation. Only the vm areas on the node that has a memory shortage need to be scanned and reclaimed, it doesn't need reclaim everything globally when a single node runs out of memory. Yes, this may not give quite as good microbenchmark scalability results, but being able to locate each vm area in node local memory and have operation on them largely isolated to node-local tasks and vmalloc area reclaim will work much better on large multi-socket NUMA machines. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx