Re: knfsd performance

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 18 Jun 2024 15:38:43 -0400

On Tue, 2024-06-18 at 18:32 +0000, Trond Myklebust wrote:
> I recently back ported Neil's lwq code and sunrpc server changes to
> our
> 5.15.130 based kernel in the hope of improving the performance for
> our
> data servers.
> 
> Our performance team recently ran a fio workload on a client that was
> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection
> (infiniband) against that resulting server. I've attached the
> resulting
> flame graph from a perf profile run on the server side.
> 
> Is anyone else seeing this massive contention for the spin lock in
> __lwq_dequeue? As you can see, it appears to be dwarfing all the
> other
> nfsd activity on the system in question here, being responsible for
> 45%
> of all the perf hits.
> 
> 

I haven't spent much time on performance testing since I keep getting
involved in bugs. It looks like that's just the way lwq works. From the
comments in lib/lwq.c:

 * Entries are dequeued using a spinlock to protect against multiple
 * access.  The llist is staged in reverse order, and refreshed
 * from the llist when it exhausts.
 *
 * This is particularly suitable when work items are queued in BH or
 * IRQ context, and where work items are handled one at a time by
 * dedicated threads.

...we have dedicated threads, but we usually have a lot of them, so
that lock ends up being pretty contended.

Is the box you're testing on NUMA-enabled? Setting the server for
pool_mode=pernode might be worth an experiment. At least you'd have
more than one lwq and less cross-node chatter. You could also try
pool_mode=percpu, but that's rumored to not be as helpful.

Maybe we need to consider some other lockless queueing mechanism longer
term, but I'm not sure how possible that is.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>