On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote: > > On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote: > > > A number of nodes which are used in the alloc/free paths is > > > set based on num_possible_cpus() in a system. Please note a > > > high limit threshold though is fixed and corresponds to 128 > > > nodes. > > > > Large CPU count machines are NUMA machines. ALl of the allocation > > and reclaim is NUMA node based i.e. a pgdat per NUMA node. > > > > Shrinkers are also able to be run in a NUMA aware mode so that > > per-node structures can be reclaimed similar to how per-node LRU > > lists are scanned for reclaim. > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > area per pgdat (or sub-node cluster) rather than just base the > > number on CPU count and then have an arbitrary maximum number when > > we get to 128 CPU cores. We can have 128 CPU cores in a > > single socket these days, so not being able to scale the vmalloc > > areas beyond a single socket seems like a bit of a limitation. > > > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > area per pgdat (or sub-node cluster) rather than just base the > > > > Scaling out the vmalloc areas in a NUMA aware fashion allows the > > shrinker to be run in numa aware mode, which gets rid of the need > > for the global shrinker to loop over every single vmap area in every > > shrinker invocation. Only the vm areas on the node that has a memory > > shortage need to be scanned and reclaimed, it doesn't need reclaim > > everything globally when a single node runs out of memory. > > > > Yes, this may not give quite as good microbenchmark scalability > > results, but being able to locate each vm area in node local memory > > and have operation on them largely isolated to node-local tasks and > > vmalloc area reclaim will work much better on large multi-socket > > NUMA machines. > > > Currently i fix the max nodes number to 128. This is because i do not > have an access to such big NUMA systems whereas i do have an access to > around ~128 ones. That is why i have decided to stop on that number as > of now. I suspect you are confusing number of CPUs with number of NUMA nodes. A NUMA system with 128 nodes is a large NUMA system that will have thousands of CPU cores, whilst above you talk about basing the count on CPU cores and that a single socket can have 128 cores? > We can easily set nr_nodes to num_possible_cpus() and let it scale for > anyone. But before doing this, i would like to give it a try as a first > step because i have not tested it well on really big NUMA systems. I don't think you need to have large NUMA systems to test it. We have the "fakenuma" feature for a reason. Essentially, once you have enough CPU cores that catastrophic lock contention can be generated in a fast path (can take as few as 4-5 CPU cores), then you can effectively test NUMA scalability with fakenuma by creating nodes with >=8 CPUs each. This is how I've done testing of numa aware algorithms (like shrinkers!) for the past decade - I haven't had direct access to a big NUMA machine since 2008, yet it's relatively trivial to test NUMA based scalability algorithms without them these days. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx