Re: knfsd performance

"NeilBrown" <neilb@xxxxxxx> · Thu, 20 Jun 2024 07:25:15 +1000

On Wed, 19 Jun 2024, NeilBrown wrote:
> On Wed, 19 Jun 2024, Dave Chinner wrote:
> > On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote  > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > 
> > > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
> > > >> 
> > > >> 
> > > >>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust
> > > >>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > >>> 
> > > >>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
> > > >>>> 
> > > >>>> 
> > > >>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust
> > > >>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > >>>>> 
> > > >>>>> I recently back ported Neil's lwq code and sunrpc server
> > > >>>>> changes to
> > > >>>>> our
> > > >>>>> 5.15.130 based kernel in the hope of improving the performance
> > > >>>>> for
> > > >>>>> our
> > > >>>>> data servers.
> > > >>>>> 
> > > >>>>> Our performance team recently ran a fio workload on a client
> > > >>>>> that
> > > >>>>> was
> > > >>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection
> > > >>>>> (infiniband) against that resulting server. I've attached the
> > > >>>>> resulting
> > > >>>>> flame graph from a perf profile run on the server side.
> > > >>>>> 
> > > >>>>> Is anyone else seeing this massive contention for the spin lock
> > > >>>>> in
> > > >>>>> __lwq_dequeue? As you can see, it appears to be dwarfing all
> > > >>>>> the
> > > >>>>> other
> > > >>>>> nfsd activity on the system in question here, being responsible
> > > >>>>> for
> > > >>>>> 45%
> > > >>>>> of all the perf hits.
> > 
> > Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock.
> > 
> > llist_reverse_order() is an O(n) algorithm involving full length
> > linked list traversal. IOWs, it's a worst case cache miss algorithm
> > running under a spin lock. And then consider what happens when
> > enqueue processing is faster than dequeue processing.
> 
> My expectation was that if enqueue processing (incoming packets) was
> faster than dequeue processing (handling NFS requests) then there was a
> bottleneck elsewhere, and this one wouldn't be relevant.
> 
> It might be useful to measure how long the queue gets.

Thinking about this some more ....  if it did turn out that the queue
gets long, and maybe even if it didn't, we could reimplement lwq as a
simple linked list with head and tail pointers.

enqueue would be something like:

  new->next = NULL;
  old_tail = xchg(&q->tail, new);
  if (old_tail)
       /* dequeue of old_tail cannot succeed until this assignment completes */
       old_tail->next = new
  else
       q->head = new

dequeue would be

  spinlock()
  ret = q->head;
  if (ret) {
        while (ret->next == NULL && cmp_xchg(&q->tail, ret, NULL) != ret)
            /* wait for enqueue of q->tail to complete */
            cpu_relax();
  }
  cmp_xchg(&q->head, ret, ret->next);
  spin_unlock();

plus some barriers and a bit more careful thought.  That would put a
stronger limit on the time that the spinlock was held, though it could
still have to wait for another CPU occasionally.

NeilBrown