Re: knfsd performance

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 20 Jun 2024 06:18:51 -0400



On Thu, 2024-06-20 at 12:29 +1000, Dave Chinner wrote:
> On Thu, Jun 20, 2024 at 07:25:15AM +1000, NeilBrown wrote:
> > On Wed, 19 Jun 2024, NeilBrown wrote:
> > > On Wed, 19 Jun 2024, Dave Chinner wrote:
> > > > On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote  > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > 
> > > > > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > > On Jun 18, 2024, at 3:29 PM, Trond Myklebust
> > > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > > 
> > > > > > > > On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > On Jun 18, 2024, at 2:32 PM, Trond Myklebust
> > > > > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > > > > 
> > > > > > > > > > I recently back ported Neil's lwq code and sunrpc server
> > > > > > > > > > changes to
> > > > > > > > > > our
> > > > > > > > > > 5.15.130 based kernel in the hope of improving the performance
> > > > > > > > > > for
> > > > > > > > > > our
> > > > > > > > > > data servers.
> > > > > > > > > > 
> > > > > > > > > > Our performance team recently ran a fio workload on a client
> > > > > > > > > > that
> > > > > > > > > > was
> > > > > > > > > > doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection
> > > > > > > > > > (infiniband) against that resulting server. I've attached the
> > > > > > > > > > resulting
> > > > > > > > > > flame graph from a perf profile run on the server side.
> > > > > > > > > > 
> > > > > > > > > > Is anyone else seeing this massive contention for the spin lock
> > > > > > > > > > in
> > > > > > > > > > __lwq_dequeue? As you can see, it appears to be dwarfing all
> > > > > > > > > > the
> > > > > > > > > > other
> > > > > > > > > > nfsd activity on the system in question here, being responsible
> > > > > > > > > > for
> > > > > > > > > > 45%
> > > > > > > > > > of all the perf hits.
> > > > 
> > > > Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock.
> > > > 
> > > > llist_reverse_order() is an O(n) algorithm involving full length
> > > > linked list traversal. IOWs, it's a worst case cache miss algorithm
> > > > running under a spin lock. And then consider what happens when
> > > > enqueue processing is faster than dequeue processing.
> > > 
> > > My expectation was that if enqueue processing (incoming packets) was
> > > faster than dequeue processing (handling NFS requests) then there was a
> > > bottleneck elsewhere, and this one wouldn't be relevant.
> > > 
> > > It might be useful to measure how long the queue gets.
> > 
> > Thinking about this some more ....  if it did turn out that the queue
> > gets long, and maybe even if it didn't, we could reimplement lwq as a
> > simple linked list with head and tail pointers.
> > 
> > enqueue would be something like:
> > 
> >   new->next = NULL;
> >   old_tail = xchg(&q->tail, new);
> >   if (old_tail)
> >        /* dequeue of old_tail cannot succeed until this assignment completes */
> >        old_tail->next = new
> >   else
> >        q->head = new
> > 
> > dequeue would be
> > 
> >   spinlock()
> >   ret = q->head;
> >   if (ret) {
> >         while (ret->next == NULL && cmp_xchg(&q->tail, ret, NULL) != ret)
> >             /* wait for enqueue of q->tail to complete */
> >             cpu_relax();
> >   }
> >   cmp_xchg(&q->head, ret, ret->next);
> >   spin_unlock();
> 
> That might work, but I suspect that it's still only putting off the
> inevitable.
> 
> Doing the dequeue purely with atomic operations might be possible,
> but it's not immediately obvious to me how to solve both head/tail
> race conditions with atomic operations. I can work out an algorithm
> that makes enqueue safe against dequeue races (or vice versa), but I
> can't also get the logic on the opposite side to also be safe.
> 
> I'll let it bounce around my head a bit more...
> 

The latest version of the multigrain timestamp patches doesn't use it,
but Jan Kara pointed out to me that there is a cmpxchg128 function in
the kernel. It's only defined for some arches (x86, s390, and aarch64,
I think), but maybe that could be used here.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>