Re: knfsd performance

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Thu, 20 Jun 2024 23:57:02 +0000

On Fri, 2024-06-21 at 08:04 +1000, NeilBrown wrote:
> On Fri, 21 Jun 2024, Chuck Lever III wrote:
> > 
> > 
> > > On Jun 19, 2024, at 10:29 PM, Dave Chinner <david@xxxxxxxxxxxxx>
> > > wrote:
> > > 
> > > On Thu, Jun 20, 2024 at 07:25:15AM +1000, NeilBrown wrote:
> > > > On Wed, 19 Jun 2024, NeilBrown wrote:
> > > > > On Wed, 19 Jun 2024, Dave Chinner wrote:
> > > > > > On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III
> > > > > > wrote  > On Jun 18, 2024, at 3:50 PM, Trond Myklebust
> > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > > 
> > > > > > > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III
> > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > On Jun 18, 2024, at 3:29 PM, Trond Myklebust
> > > > > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > > > > 
> > > > > > > > > > On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III
> > > > > > > > > > wrote:
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > On Jun 18, 2024, at 2:32 PM, Trond Myklebust
> > > > > > > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > I recently back ported Neil's lwq code and
> > > > > > > > > > > > sunrpc server
> > > > > > > > > > > > changes to
> > > > > > > > > > > > our
> > > > > > > > > > > > 5.15.130 based kernel in the hope of improving
> > > > > > > > > > > > the performance
> > > > > > > > > > > > for
> > > > > > > > > > > > our
> > > > > > > > > > > > data servers.
> > > > > > > > > > > > 
> > > > > > > > > > > > Our performance team recently ran a fio
> > > > > > > > > > > > workload on a client
> > > > > > > > > > > > that
> > > > > > > > > > > > was
> > > > > > > > > > > > doing 100% NFSv3 reads in O_DIRECT mode over an
> > > > > > > > > > > > RDMA connection
> > > > > > > > > > > > (infiniband) against that resulting server.
> > > > > > > > > > > > I've attached the
> > > > > > > > > > > > resulting
> > > > > > > > > > > > flame graph from a perf profile run on the
> > > > > > > > > > > > server side.
> > > > > > > > > > > > 
> > > > > > > > > > > > Is anyone else seeing this massive contention
> > > > > > > > > > > > for the spin lock
> > > > > > > > > > > > in
> > > > > > > > > > > > __lwq_dequeue? As you can see, it appears to be
> > > > > > > > > > > > dwarfing all
> > > > > > > > > > > > the
> > > > > > > > > > > > other
> > > > > > > > > > > > nfsd activity on the system in question here,
> > > > > > > > > > > > being responsible
> > > > > > > > > > > > for
> > > > > > > > > > > > 45%
> > > > > > > > > > > > of all the perf hits.
> > > > > > 
> > > > > > Ouch. __lwq_dequeue() runs llist_reverse_order() under a
> > > > > > spinlock.
> > > > > > 
> > > > > > llist_reverse_order() is an O(n) algorithm involving full
> > > > > > length
> > > > > > linked list traversal. IOWs, it's a worst case cache miss
> > > > > > algorithm
> > > > > > running under a spin lock. And then consider what happens
> > > > > > when
> > > > > > enqueue processing is faster than dequeue processing.
> > > > > 
> > > > > My expectation was that if enqueue processing (incoming
> > > > > packets) was
> > > > > faster than dequeue processing (handling NFS requests) then
> > > > > there was a
> > > > > bottleneck elsewhere, and this one wouldn't be relevant.
> > > > > 
> > > > > It might be useful to measure how long the queue gets.
> > > > 
> > > > Thinking about this some more ....  if it did turn out that the
> > > > queue
> > > > gets long, and maybe even if it didn't, we could reimplement
> > > > lwq as a
> > > > simple linked list with head and tail pointers.
> > > > 
> > > > enqueue would be something like:
> > > > 
> > > >  new->next = NULL;
> > > >  old_tail = xchg(&q->tail, new);
> > > >  if (old_tail)
> > > >       /* dequeue of old_tail cannot succeed until this
> > > > assignment completes */
> > > >       old_tail->next = new
> > > >  else
> > > >       q->head = new
> > > > 
> > > > dequeue would be
> > > > 
> > > >  spinlock()
> > > >  ret = q->head;
> > > >  if (ret) {
> > > >        while (ret->next == NULL && cmp_xchg(&q->tail, ret,
> > > > NULL) != ret)
> > > >            /* wait for enqueue of q->tail to complete */
> > > >            cpu_relax();
> > > >  }
> > > >  cmp_xchg(&q->head, ret, ret->next);
> > > >  spin_unlock();
> > > 
> > > That might work, but I suspect that it's still only putting off
> > > the
> > > inevitable.
> > > 
> > > Doing the dequeue purely with atomic operations might be
> > > possible,
> > > but it's not immediately obvious to me how to solve both
> > > head/tail
> > > race conditions with atomic operations. I can work out an
> > > algorithm
> > > that makes enqueue safe against dequeue races (or vice versa),
> > > but I
> > > can't also get the logic on the opposite side to also be safe.
> > > 
> > > I'll let it bounce around my head a bit more...
> > 
> > I agree that O(n) dequeuing is potentially alarming.
> 
> Only O(n) 1/n of the time.  On average it is still constant time.
> 
> > 
> > Before we go too far down this path, I'd like to see
> > reproducible numbers that show there is a problem
> > when a recent upstream NFS server is properly set up
> > with a sensible number of threads and running a real
> > workload.

The flame graph is showing contention on a spin lock, so it is
affecting running threads only. It will be 100% insensitive to sleeping
threads.

Yes, there is the issue of NUMA and the resulting potential for cache
bouncing between CPUs, and I do agree that particular detail warrants a
re-test. However I don't see how the overall number of nfsd threads is
relevant. Nor do I see how the use of a random read workload in any way
invalidates the test.

> Question for Trond: was nconnect configured, or was there only a
> single
> connection?
> 
> With a single connection there is only ever zero or one xprt in the
> queue to be dequeued, and if there are zero we don't take the lock.
> 
> With 16 connections they might always be busy so as soon as a request
> is
> read from the connection it is requeued.  This means 1/16 of dequeue
> operations would be slowish and the other 15/16 would be fast.
> 
> Maybe the 1/16 slow case could swamp the others but I'd be surprised.
> 

We deliberately coded up commit b326df4a8ec6 ("NFS: enable nconnect for
RDMA") because of the poor performance of RDMA when compared to
nconnect enabled TCP on these tests. So yes, this particular test was
running with both RDMA and nconnect enabled.

However given that the contended spinlock is not per connection, but
rather per pool, I don't see how nconnect matters much either since all
that does is to increase the number of connections. The other obvious
way to do that is to increase the number of NFS clients.

> > 
> > Otherwise there is a risk of introducing code in a
> > fundamental part of SunRPC that is optimized to the
> > point of brittleness, and for no good reason.
> > 
> > This is what keeps me from sleeping at night: [1]
> > See, it even has my name on it. :-)
> > 
> > 
> > --
> > Chuck Lever
> > 
> > [1] -
> > https://www.linusakesson.net/programming/kernighans-lever/index.php
> 
> The conclusion of that article is that we SHOULD try to write clever
> code because the effort invested in writing it and then debugging it
> makes us cleverer so that the next time we can do even better.  That
> thought would help me sleep at night!
> 

I mostly sleep well.

However that doesn't stop me from thinking that we have a good reason
to address bottlenecks when they are as obvious as this.

Again, I'm happy to see if I can get hold of some more hardware to test
the NUMA hypothesis. However, if that retest shows the contention is
still present, then we should consider the proposals that have been
made in this thread.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx