On Wed, 19 Jun 2024, Jeff Layton wrote: > On Tue, 2024-06-18 at 18:32 +0000, Trond Myklebust wrote: > > I recently back ported Neil's lwq code and sunrpc server changes to > > our > > 5.15.130 based kernel in the hope of improving the performance for > > our > > data servers. > > > > Our performance team recently ran a fio workload on a client that was > > doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection > > (infiniband) against that resulting server. I've attached the > > resulting > > flame graph from a perf profile run on the server side. > > > > Is anyone else seeing this massive contention for the spin lock in > > __lwq_dequeue? As you can see, it appears to be dwarfing all the > > other > > nfsd activity on the system in question here, being responsible for > > 45% > > of all the perf hits. > > > > > > I haven't spent much time on performance testing since I keep getting > involved in bugs. It looks like that's just the way lwq works. From the > comments in lib/lwq.c: > > * Entries are dequeued using a spinlock to protect against multiple > * access. The llist is staged in reverse order, and refreshed > * from the llist when it exhausts. > * > * This is particularly suitable when work items are queued in BH or > * IRQ context, and where work items are handled one at a time by > * dedicated threads. > > ...we have dedicated threads, but we usually have a lot of them, so > that lock ends up being pretty contended. > > Is the box you're testing on NUMA-enabled? Setting the server for > pool_mode=pernode might be worth an experiment. At least you'd have > more than one lwq and less cross-node chatter. You could also try > pool_mode=percpu, but that's rumored to not be as helpful. > > Maybe we need to consider some other lockless queueing mechanism longer > term, but I'm not sure how possible that is. I spent a lot of thought trying to come up with a lockless dequeue and failed. I could do it if we had generic load-locked/store-conditional primitives, but that requires hardware support to be efficient. compare-and-exchange cannot do it. The core of the problem is that dequeue requires concurrent control of the "head" pointer and the selected entry. cmpxchg can only control one address at a time. If the lock is highly contended, then sharding is likely the best option - across numa nodes or across some other subset of CPUs. NeilBrown