Re: knfsd performance

"NeilBrown" <neilb@xxxxxxx> · Wed, 19 Jun 2024 09:17:29 +1000

On Wed, 19 Jun 2024, Jeff Layton wrote:
> On Tue, 2024-06-18 at 19:54 +0000, Chuck Lever III wrote:
> > 
> > 
> > > On Jun 18, 2024, at 3:50 PM, Trond Myklebust
> > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > 
> > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
> > > > 
> > > > 
> > > > > On Jun 18, 2024, at 3:29 PM, Trond Myklebust
> > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > 
> > > > > On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
> > > > > > 
> > > > > > 
> > > > > > > On Jun 18, 2024, at 2:32 PM, Trond Myklebust
> > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > 
> > > > > > > I recently back ported Neil's lwq code and sunrpc server
> > > > > > > changes to
> > > > > > > our
> > > > > > > 5.15.130 based kernel in the hope of improving the
> > > > > > > performance
> > > > > > > for
> > > > > > > our
> > > > > > > data servers.
> > > > > > > 
> > > > > > > Our performance team recently ran a fio workload on a
> > > > > > > client
> > > > > > > that
> > > > > > > was
> > > > > > > doing 100% NFSv3 reads in O_DIRECT mode over an RDMA
> > > > > > > connection
> > > > > > > (infiniband) against that resulting server. I've attached
> > > > > > > the
> > > > > > > resulting
> > > > > > > flame graph from a perf profile run on the server side.
> > > > > > > 
> > > > > > > Is anyone else seeing this massive contention for the spin
> > > > > > > lock
> > > > > > > in
> > > > > > > __lwq_dequeue? As you can see, it appears to be dwarfing
> > > > > > > all
> > > > > > > the
> > > > > > > other
> > > > > > > nfsd activity on the system in question here, being
> > > > > > > responsible
> > > > > > > for
> > > > > > > 45%
> > > > > > > of all the perf hits.
> > > > > > 
> > > > > > I haven't seen that, but I've been working on other issues.
> > > > > > 
> > > > > > What's the nfsd thread count on your test server? Have you
> > > > > > seen a similar impact on 6.10 kernels ?
> > > > > > 
> > > > > 
> > > > > 640 knfsd threads. The machine was a supermicro 2029BT-HNR with
> > > > > 2xIntel
> > > > > 6150, 384GB of memory and 6xWDC SN840.
> > > > > 
> > > > > Unfortunately, the machine was a loaner, so cannot compare to
> > > > > 6.10.
> > > > > That's why I was asking if anyone has seen anything similar.
> > > > 
> > > > If this system had more than one NUMA node, then using
> > > > svc's "numa pool" mode might have helped.
> > > > 
> > > 
> > > Interesting. I had forgotten about that setting.
> > > 
> > > Just out of curiosity, is there any reason why we might not want to
> > > default to that mode on a NUMA enabled system?
> > 
> > Can't think of one off hand. Maybe back in the day it was
> > hard to tell when you were actually /on/ a NUMA system.
> > 
> > Copying Dave to see if he has any recollection.
> > 
> 
> It's at least partly because of the klunkiness of the old pool_threads
> interface: You have to bring up the server first using the "threads"
> procfile, and then you can actually bring up threads in the various
> pools using pool_threads.
> 
> Same for shutdown. You have to bring down the pool_threads first and
> then you can bring down the final thread and the rest of the server
> with it. Why it was designed this way, I have NFC.
> 
> The new nfsdctl tool and netlink interfaces should make this simpler in
> the future. You'll be able to set the pool-mode in /etc/nfs.conf and
> configure a list of per-pool thread counts in there too. Once we have
> that, I think we'll be in a better position to consider doing it by
> default.
> 
> Eventually we'd like to make the thread poos dynamic, at which point
> making that the default becomes much simpler from an administrative
> standpoint.

I agree that dynamic thread pools will make numa management simpler.
Greg Banks did the numa work for SGI - I wonder where he is now.  He was
at fastmail 10 years ago..

The idea was to bind network interfaces to numa nodes with interrupt
routing.  There was no expectation that work would be distributed evenly
across all nodes. Some might be dedicated to non-nfs work.  So there was
expected to be non-trivial configuration for both IRQ routing and
threads-per-node.  If we can make threads-per-node demand-based, then
half the problem goes away.

We could even default to one-thread-pool-per-CPU if there are more than
X cpus....

NeilBrown