On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote: > >> > >> > >>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust > >>> <trondmy@xxxxxxxxxxxxxxx> wrote: > >>> > >>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote: > >>>> > >>>> > >>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust > >>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: > >>>>> > >>>>> I recently back ported Neil's lwq code and sunrpc server > >>>>> changes to > >>>>> our > >>>>> 5.15.130 based kernel in the hope of improving the performance > >>>>> for > >>>>> our > >>>>> data servers. > >>>>> > >>>>> Our performance team recently ran a fio workload on a client > >>>>> that > >>>>> was > >>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection > >>>>> (infiniband) against that resulting server. I've attached the > >>>>> resulting > >>>>> flame graph from a perf profile run on the server side. > >>>>> > >>>>> Is anyone else seeing this massive contention for the spin lock > >>>>> in > >>>>> __lwq_dequeue? As you can see, it appears to be dwarfing all > >>>>> the > >>>>> other > >>>>> nfsd activity on the system in question here, being responsible > >>>>> for > >>>>> 45% > >>>>> of all the perf hits. Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock. llist_reverse_order() is an O(n) algorithm involving full length linked list traversal. IOWs, it's a worst case cache miss algorithm running under a spin lock. And then consider what happens when enqueue processing is faster than dequeue processing. This means the depth of the queue grows, and ultimate length of the queue is unbound. Because fo the batch processing nature of lwq - it takes ->new, reverses it and places it in ->ready - the length of the list that needs reversing ends up growing every batch that we queue faster than we dequeue. Unbound processing queues are bad even when they have O(1) behaviour. lwq has O(n) worst case behaviour, and that makes this even worse... Regardless, The current lwq could be slightly improved - the lockless enqueue competes for the same cacheline as the dequeue serialisation lock. struct lwq { spinlock_t lock; struct llist_node *ready; /* entries to be dequeued */ struct llist_head new; /* entries being enqueued */ }; Adding __cacheline_aligned_in_smp to ->new (the enqueue side) might help reduce this enqueue/dequeue cacheline contention a bit by separating them onto different cachelines. That will push the point of catastrophic breakdown out a little bit, not solve the issue of queue depth based batch processing on the dequeue side. I suspect a lockless ring buffer might be a more scalable solution for the nfsd... > >>>> I haven't seen that, but I've been working on other issues. > >>>> > >>>> What's the nfsd thread count on your test server? Have you > >>>> seen a similar impact on 6.10 kernels ? > >>>> > >>> > >>> 640 knfsd threads. The machine was a supermicro 2029BT-HNR with > >>> 2xIntel > >>> 6150, 384GB of memory and 6xWDC SN840. > >>> > >>> Unfortunately, the machine was a loaner, so cannot compare to 6.10. > >>> That's why I was asking if anyone has seen anything similar. > >> > >> If this system had more than one NUMA node, then using > >> svc's "numa pool" mode might have helped. It's a dual socket machine so it has at least 2 physical nodes. Of course, the bios has to be configured to expose this as a NUMA machine and not a "legacy SMP" machine for the OS to know that, but I'm pretty sure that's been the typical server bios defaults for quite a few years now. Even if it wasn't a dual socket machine, the CPU itself is a NUMA architecture. i.e. just about every server x86-64 CPU sold these days is a NUMA SOC - neither core-to-core or core-to-memory latency is uniform within a socket these days. Desktop CPUs are also well down this track, too. Intel exposes the details of the topology within a socket via the bios option known as "sub-numa clustering". This exposes the full sub-socket CPU and memory topology to the OS, so it is fully aware of both the on-chip and off-chip topology. Using sub-numa clustering means we don't end up with 32+ CPUS to a per-socket numa node. Numa scalability algorithms largely rely on keeping the cores-per-numa-node ratio in check. Sub-numa clustering enables the OS to keep this ratio down to reasonable levels. > > Interesting. I had forgotten about that setting. > > > > Just out of curiosity, is there any reason why we might not want to > > default to that mode on a NUMA enabled system? > > Can't think of one off hand. Maybe back in the day it was > hard to tell when you were actually /on/ a NUMA system. As per above, I would assume that the kernel is *always* running on a NUMA machine. IOWs, if CONFIG_NUMA is enabled (which it is on just about every distro kernel these days), then we should be using NUMA optimisations by default. If the machine is not a NUMA machine (e.g. single socket, sub-numa clustering off), then the NUMA subsystem will be initialised with nr_online_nodes = 1 (i.e. a single active node) and numa aware algorithms should just behave as if there is a single global node. If CONFIG_NUMA=n, then nr_online_nodes is hard coded to 1. Hence subsystems only need to implement a single algorithm that is NUMA aware. The bios/system config will tell the kernel how many nodes there are, and the NUMA algorithms will just do the right thing because nr_online_nodes=1 in those situations.. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx