On Fri, Jan 19, 2024 at 08:28:05AM +1100, Dave Chinner wrote: > On Thu, Jan 18, 2024 at 07:23:47PM +0100, Uladzislau Rezki wrote: > > On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote: > > > On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote: > > > > We can easily set nr_nodes to num_possible_cpus() and let it scale for > > > > anyone. But before doing this, i would like to give it a try as a first > > > > step because i have not tested it well on really big NUMA systems. > > > > > > I don't think you need to have large NUMA systems to test it. We > > > have the "fakenuma" feature for a reason. Essentially, once you > > > have enough CPU cores that catastrophic lock contention can be > > > generated in a fast path (can take as few as 4-5 CPU cores), then > > > you can effectively test NUMA scalability with fakenuma by creating > > > nodes with >=8 CPUs each. > > > > > > This is how I've done testing of numa aware algorithms (like > > > shrinkers!) for the past decade - I haven't had direct access to a > > > big NUMA machine since 2008, yet it's relatively trivial to test > > > NUMA based scalability algorithms without them these days. > > > > > I see your point. NUMA-aware scalability require reworking adding extra > > layer that allows such scaling. > > > > If the socket has 256 CPUs, how do scale VAs inside that node among > > those CPUs? > > It's called "sub-numa clustering" and is a bios option that presents > large core count CPU packages as multiple NUMA nodes. See: > > https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html > > Essentially, large core count CPUs are a cluster of smaller core > groups with their own resources and memory controllers. This is how > they are laid out either on a single die (intel) or as a collection > of smaller dies (AMD compute complexes) that are tied together by > the interconnect between the LLCs and memory controllers. They only > appear as a "unified" CPU because they are configured that way by > the bios, but can also be configured to actually expose their inner > non-uniform memory access topology for operating systems and > application stacks that are NUMA aware (like Linux). > > This means a "256 core" CPU would probably present as 16 smaller 16 > core CPUs each with their own L1/2/3 caches and memory controllers. > IOWs, a single socket appears to the kernel as a 16 node NUMA system > with 16 cores per node. Most NUMA aware scalability algorithms will > work just fine with this sort setup - it's just another set of > numbers in the NUMA distance table... > Thank you for your input. I will go through it to see what we can do in terms of NUMA-aware with thousands of CPUs in total. Thanks! -- Uladzislau Rezki