On Tue, 2024-06-18 at 18:32 +0000, Trond Myklebust wrote: > I recently back ported Neil's lwq code and sunrpc server changes to > our > 5.15.130 based kernel in the hope of improving the performance for > our > data servers. > > Our performance team recently ran a fio workload on a client that was > doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection > (infiniband) against that resulting server. I've attached the > resulting > flame graph from a perf profile run on the server side. > > Is anyone else seeing this massive contention for the spin lock in > __lwq_dequeue? As you can see, it appears to be dwarfing all the > other > nfsd activity on the system in question here, being responsible for > 45% > of all the perf hits. > > I haven't spent much time on performance testing since I keep getting involved in bugs. It looks like that's just the way lwq works. From the comments in lib/lwq.c: * Entries are dequeued using a spinlock to protect against multiple * access. The llist is staged in reverse order, and refreshed * from the llist when it exhausts. * * This is particularly suitable when work items are queued in BH or * IRQ context, and where work items are handled one at a time by * dedicated threads. ...we have dedicated threads, but we usually have a lot of them, so that lock ends up being pretty contended. Is the box you're testing on NUMA-enabled? Setting the server for pool_mode=pernode might be worth an experiment. At least you'd have more than one lwq and less cross-node chatter. You could also try pool_mode=percpu, but that's rumored to not be as helpful. Maybe we need to consider some other lockless queueing mechanism longer term, but I'm not sure how possible that is. -- Jeff Layton <jlayton@xxxxxxxxxx>