On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote: > On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote: > > > On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote: > > > > A number of nodes which are used in the alloc/free paths is > > > > set based on num_possible_cpus() in a system. Please note a > > > > high limit threshold though is fixed and corresponds to 128 > > > > nodes. > > > > > > Large CPU count machines are NUMA machines. ALl of the allocation > > > and reclaim is NUMA node based i.e. a pgdat per NUMA node. > > > > > > Shrinkers are also able to be run in a NUMA aware mode so that > > > per-node structures can be reclaimed similar to how per-node LRU > > > lists are scanned for reclaim. > > > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > > area per pgdat (or sub-node cluster) rather than just base the > > > number on CPU count and then have an arbitrary maximum number when > > > we get to 128 CPU cores. We can have 128 CPU cores in a > > > single socket these days, so not being able to scale the vmalloc > > > areas beyond a single socket seems like a bit of a limitation. > > > > > > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > > area per pgdat (or sub-node cluster) rather than just base the > > > > > > Scaling out the vmalloc areas in a NUMA aware fashion allows the > > > shrinker to be run in numa aware mode, which gets rid of the need > > > for the global shrinker to loop over every single vmap area in every > > > shrinker invocation. Only the vm areas on the node that has a memory > > > shortage need to be scanned and reclaimed, it doesn't need reclaim > > > everything globally when a single node runs out of memory. > > > > > > Yes, this may not give quite as good microbenchmark scalability > > > results, but being able to locate each vm area in node local memory > > > and have operation on them largely isolated to node-local tasks and > > > vmalloc area reclaim will work much better on large multi-socket > > > NUMA machines. > > > > > Currently i fix the max nodes number to 128. This is because i do not > > have an access to such big NUMA systems whereas i do have an access to > > around ~128 ones. That is why i have decided to stop on that number as > > of now. > > I suspect you are confusing number of CPUs with number of NUMA nodes. > I do not think so :) > > A NUMA system with 128 nodes is a large NUMA system that will have > thousands of CPU cores, whilst above you talk about basing the > count on CPU cores and that a single socket can have 128 cores? > > > We can easily set nr_nodes to num_possible_cpus() and let it scale for > > anyone. But before doing this, i would like to give it a try as a first > > step because i have not tested it well on really big NUMA systems. > > I don't think you need to have large NUMA systems to test it. We > have the "fakenuma" feature for a reason. Essentially, once you > have enough CPU cores that catastrophic lock contention can be > generated in a fast path (can take as few as 4-5 CPU cores), then > you can effectively test NUMA scalability with fakenuma by creating > nodes with >=8 CPUs each. > > This is how I've done testing of numa aware algorithms (like > shrinkers!) for the past decade - I haven't had direct access to a > big NUMA machine since 2008, yet it's relatively trivial to test > NUMA based scalability algorithms without them these days. > I see your point. NUMA-aware scalability require reworking adding extra layer that allows such scaling. If the socket has 256 CPUs, how do scale VAs inside that node among those CPUs? -- Uladzislau Rezki