Re: knfsd performance

"NeilBrown" <neilb@xxxxxxx> · Wed, 19 Jun 2024 11:01:02 +1000

On Wed, 19 Jun 2024, Dave Chinner wrote:
> On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote  > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > 
> > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
> > >> 
> > >> 
> > >>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust
> > >>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > >>> 
> > >>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
> > >>>> 
> > >>>> 
> > >>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust
> > >>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > >>>>> 
> > >>>>> I recently back ported Neil's lwq code and sunrpc server
> > >>>>> changes to
> > >>>>> our
> > >>>>> 5.15.130 based kernel in the hope of improving the performance
> > >>>>> for
> > >>>>> our
> > >>>>> data servers.
> > >>>>> 
> > >>>>> Our performance team recently ran a fio workload on a client
> > >>>>> that
> > >>>>> was
> > >>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection
> > >>>>> (infiniband) against that resulting server. I've attached the
> > >>>>> resulting
> > >>>>> flame graph from a perf profile run on the server side.
> > >>>>> 
> > >>>>> Is anyone else seeing this massive contention for the spin lock
> > >>>>> in
> > >>>>> __lwq_dequeue? As you can see, it appears to be dwarfing all
> > >>>>> the
> > >>>>> other
> > >>>>> nfsd activity on the system in question here, being responsible
> > >>>>> for
> > >>>>> 45%
> > >>>>> of all the perf hits.
> 
> Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock.
> 
> llist_reverse_order() is an O(n) algorithm involving full length
> linked list traversal. IOWs, it's a worst case cache miss algorithm
> running under a spin lock. And then consider what happens when
> enqueue processing is faster than dequeue processing.

My expectation was that if enqueue processing (incoming packets) was
faster than dequeue processing (handling NFS requests) then there was a
bottleneck elsewhere, and this one wouldn't be relevant.

It might be useful to measure how long the queue gets.

> 
> This means the depth of the queue grows, and ultimate length of the
> queue is unbound. Because fo the batch processing nature of lwq -
> it takes ->new, reverses it and places it in ->ready - the length of
> the list that needs reversing ends up growing every batch that
> we queue faster than we dequeue. Unbound processing queues are bad
> even when they have O(1) behaviour. lwq has O(n) worst case
> behaviour, and that makes this even worse...
> 
> Regardless, The current lwq could be slightly improved - the
> lockless enqueue competes for the same cacheline as the dequeue
> serialisation lock.
> 
> struct lwq {
>         spinlock_t              lock;
>         struct llist_node       *ready;         /* entries to be dequeued */
>         struct llist_head       new;            /* entries being enqueued */
> };
> 
> Adding __cacheline_aligned_in_smp to ->new (the enqueue side) might
> help reduce this enqueue/dequeue cacheline contention a bit by
> separating them onto different cachelines. That will push the point
> of catastrophic breakdown out a little bit, not solve the issue of
> queue depth based batch processing on the dequeue side.
> 

Yes, that might be beneficial - thanks.

> I suspect a lockless ring buffer might be a more scalable solution
> for the nfsd...

We would need to size the buffer to the maximum number of connections
(if there is one - I don't recall) or resize the buffer as the number of
connections grows.  Certainly doable.

Thanks,
NeilBrown