Re: knfsd performance

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 20 Jun 2024 12:29:47 +1000

On Thu, Jun 20, 2024 at 07:25:15AM +1000, NeilBrown wrote:
> On Wed, 19 Jun 2024, NeilBrown wrote:
> > On Wed, 19 Jun 2024, Dave Chinner wrote:
> > > On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote  > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > 
> > > > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
> > > > >> 
> > > > >> 
> > > > >>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust
> > > > >>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > >>> 
> > > > >>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
> > > > >>>> 
> > > > >>>> 
> > > > >>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust
> > > > >>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > >>>>> 
> > > > >>>>> I recently back ported Neil's lwq code and sunrpc server
> > > > >>>>> changes to
> > > > >>>>> our
> > > > >>>>> 5.15.130 based kernel in the hope of improving the performance
> > > > >>>>> for
> > > > >>>>> our
> > > > >>>>> data servers.
> > > > >>>>> 
> > > > >>>>> Our performance team recently ran a fio workload on a client
> > > > >>>>> that
> > > > >>>>> was
> > > > >>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection
> > > > >>>>> (infiniband) against that resulting server. I've attached the
> > > > >>>>> resulting
> > > > >>>>> flame graph from a perf profile run on the server side.
> > > > >>>>> 
> > > > >>>>> Is anyone else seeing this massive contention for the spin lock
> > > > >>>>> in
> > > > >>>>> __lwq_dequeue? As you can see, it appears to be dwarfing all
> > > > >>>>> the
> > > > >>>>> other
> > > > >>>>> nfsd activity on the system in question here, being responsible
> > > > >>>>> for
> > > > >>>>> 45%
> > > > >>>>> of all the perf hits.
> > > 
> > > Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock.
> > > 
> > > llist_reverse_order() is an O(n) algorithm involving full length
> > > linked list traversal. IOWs, it's a worst case cache miss algorithm
> > > running under a spin lock. And then consider what happens when
> > > enqueue processing is faster than dequeue processing.
> > 
> > My expectation was that if enqueue processing (incoming packets) was
> > faster than dequeue processing (handling NFS requests) then there was a
> > bottleneck elsewhere, and this one wouldn't be relevant.
> > 
> > It might be useful to measure how long the queue gets.
> 
> Thinking about this some more ....  if it did turn out that the queue
> gets long, and maybe even if it didn't, we could reimplement lwq as a
> simple linked list with head and tail pointers.
> 
> enqueue would be something like:
> 
>   new->next = NULL;
>   old_tail = xchg(&q->tail, new);
>   if (old_tail)
>        /* dequeue of old_tail cannot succeed until this assignment completes */
>        old_tail->next = new
>   else
>        q->head = new
>
> dequeue would be
> 
>   spinlock()
>   ret = q->head;
>   if (ret) {
>         while (ret->next == NULL && cmp_xchg(&q->tail, ret, NULL) != ret)
>             /* wait for enqueue of q->tail to complete */
>             cpu_relax();
>   }
>   cmp_xchg(&q->head, ret, ret->next);
>   spin_unlock();

That might work, but I suspect that it's still only putting off the
inevitable.

Doing the dequeue purely with atomic operations might be possible,
but it's not immediately obvious to me how to solve both head/tail
race conditions with atomic operations. I can work out an algorithm
that makes enqueue safe against dequeue races (or vice versa), but I
can't also get the logic on the opposite side to also be safe.

I'll let it bounce around my head a bit more...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx