Re: knfsd performance

"NeilBrown" <neilb@xxxxxxx> · Fri, 21 Jun 2024 08:04:57 +1000

On Fri, 21 Jun 2024, Chuck Lever III wrote:
> 
> 
> > On Jun 19, 2024, at 10:29 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > 
> > On Thu, Jun 20, 2024 at 07:25:15AM +1000, NeilBrown wrote:
> >> On Wed, 19 Jun 2024, NeilBrown wrote:
> >>> On Wed, 19 Jun 2024, Dave Chinner wrote:
> >>>> On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote  > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> >>>>>> 
> >>>>>> On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust
> >>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> >>>>>>>> 
> >>>>>>>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust
> >>>>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> >>>>>>>>>> 
> >>>>>>>>>> I recently back ported Neil's lwq code and sunrpc server
> >>>>>>>>>> changes to
> >>>>>>>>>> our
> >>>>>>>>>> 5.15.130 based kernel in the hope of improving the performance
> >>>>>>>>>> for
> >>>>>>>>>> our
> >>>>>>>>>> data servers.
> >>>>>>>>>> 
> >>>>>>>>>> Our performance team recently ran a fio workload on a client
> >>>>>>>>>> that
> >>>>>>>>>> was
> >>>>>>>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection
> >>>>>>>>>> (infiniband) against that resulting server. I've attached the
> >>>>>>>>>> resulting
> >>>>>>>>>> flame graph from a perf profile run on the server side.
> >>>>>>>>>> 
> >>>>>>>>>> Is anyone else seeing this massive contention for the spin lock
> >>>>>>>>>> in
> >>>>>>>>>> __lwq_dequeue? As you can see, it appears to be dwarfing all
> >>>>>>>>>> the
> >>>>>>>>>> other
> >>>>>>>>>> nfsd activity on the system in question here, being responsible
> >>>>>>>>>> for
> >>>>>>>>>> 45%
> >>>>>>>>>> of all the perf hits.
> >>>> 
> >>>> Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock.
> >>>> 
> >>>> llist_reverse_order() is an O(n) algorithm involving full length
> >>>> linked list traversal. IOWs, it's a worst case cache miss algorithm
> >>>> running under a spin lock. And then consider what happens when
> >>>> enqueue processing is faster than dequeue processing.
> >>> 
> >>> My expectation was that if enqueue processing (incoming packets) was
> >>> faster than dequeue processing (handling NFS requests) then there was a
> >>> bottleneck elsewhere, and this one wouldn't be relevant.
> >>> 
> >>> It might be useful to measure how long the queue gets.
> >> 
> >> Thinking about this some more ....  if it did turn out that the queue
> >> gets long, and maybe even if it didn't, we could reimplement lwq as a
> >> simple linked list with head and tail pointers.
> >> 
> >> enqueue would be something like:
> >> 
> >>  new->next = NULL;
> >>  old_tail = xchg(&q->tail, new);
> >>  if (old_tail)
> >>       /* dequeue of old_tail cannot succeed until this assignment completes */
> >>       old_tail->next = new
> >>  else
> >>       q->head = new
> >> 
> >> dequeue would be
> >> 
> >>  spinlock()
> >>  ret = q->head;
> >>  if (ret) {
> >>        while (ret->next == NULL && cmp_xchg(&q->tail, ret, NULL) != ret)
> >>            /* wait for enqueue of q->tail to complete */
> >>            cpu_relax();
> >>  }
> >>  cmp_xchg(&q->head, ret, ret->next);
> >>  spin_unlock();
> > 
> > That might work, but I suspect that it's still only putting off the
> > inevitable.
> > 
> > Doing the dequeue purely with atomic operations might be possible,
> > but it's not immediately obvious to me how to solve both head/tail
> > race conditions with atomic operations. I can work out an algorithm
> > that makes enqueue safe against dequeue races (or vice versa), but I
> > can't also get the logic on the opposite side to also be safe.
> > 
> > I'll let it bounce around my head a bit more...
> 
> I agree that O(n) dequeuing is potentially alarming.

Only O(n) 1/n of the time.  On average it is still constant time.

> 
> Before we go too far down this path, I'd like to see
> reproducible numbers that show there is a problem
> when a recent upstream NFS server is properly set up
> with a sensible number of threads and running a real
> workload.

Question for Trond: was nconnect configured, or was there only a single
connection?

With a single connection there is only ever zero or one xprt in the
queue to be dequeued, and if there are zero we don't take the lock.

With 16 connections they might always be busy so as soon as a request is
read from the connection it is requeued.  This means 1/16 of dequeue
operations would be slowish and the other 15/16 would be fast.

Maybe the 1/16 slow case could swamp the others but I'd be surprised.

> 
> Otherwise there is a risk of introducing code in a
> fundamental part of SunRPC that is optimized to the
> point of brittleness, and for no good reason.
> 
> This is what keeps me from sleeping at night: [1]
> See, it even has my name on it. :-)
> 
> 
> --
> Chuck Lever
> 
> [1] - https://www.linusakesson.net/programming/kernighans-lever/index.php

The conclusion of that article is that we SHOULD try to write clever
code because the effort invested in writing it and then debugging it
makes us cleverer so that the next time we can do even better.  That
thought would help me sleep at night!

NeilBrown