Re: knfsd performance

"NeilBrown" <neilb@xxxxxxx> · Wed, 19 Jun 2024 09:12:14 +1000

On Wed, 19 Jun 2024, Jeff Layton wrote:
> On Tue, 2024-06-18 at 18:32 +0000, Trond Myklebust wrote:
> > I recently back ported Neil's lwq code and sunrpc server changes to
> > our
> > 5.15.130 based kernel in the hope of improving the performance for
> > our
> > data servers.
> > 
> > Our performance team recently ran a fio workload on a client that was
> > doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection
> > (infiniband) against that resulting server. I've attached the
> > resulting
> > flame graph from a perf profile run on the server side.
> > 
> > Is anyone else seeing this massive contention for the spin lock in
> > __lwq_dequeue? As you can see, it appears to be dwarfing all the
> > other
> > nfsd activity on the system in question here, being responsible for
> > 45%
> > of all the perf hits.
> > 
> > 
> 
> I haven't spent much time on performance testing since I keep getting
> involved in bugs. It looks like that's just the way lwq works. From the
> comments in lib/lwq.c:
> 
>  * Entries are dequeued using a spinlock to protect against multiple
>  * access.  The llist is staged in reverse order, and refreshed
>  * from the llist when it exhausts.
>  *
>  * This is particularly suitable when work items are queued in BH or
>  * IRQ context, and where work items are handled one at a time by
>  * dedicated threads.
> 
> ...we have dedicated threads, but we usually have a lot of them, so
> that lock ends up being pretty contended.
> 
> Is the box you're testing on NUMA-enabled? Setting the server for
> pool_mode=pernode might be worth an experiment. At least you'd have
> more than one lwq and less cross-node chatter. You could also try
> pool_mode=percpu, but that's rumored to not be as helpful.
> 
> Maybe we need to consider some other lockless queueing mechanism longer
> term, but I'm not sure how possible that is.

I spent a lot of thought trying to come up with a lockless dequeue and
failed.  I could do it if we had generic load-locked/store-conditional
primitives, but that requires hardware support to be efficient.
compare-and-exchange cannot do it.
The core of the problem is that dequeue requires concurrent control of
the "head" pointer and the selected entry.  cmpxchg can only control one
address at a time.

If the lock is highly contended, then sharding is likely the best option
- across numa nodes or across some other subset of CPUs.

NeilBrown