Re: knfsd performance

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 19 Jun 2024 10:42:38 +1000

On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote  > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
> >> 
> >> 
> >>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust
> >>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> >>> 
> >>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
> >>>> 
> >>>> 
> >>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust
> >>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
> >>>>> 
> >>>>> I recently back ported Neil's lwq code and sunrpc server
> >>>>> changes to
> >>>>> our
> >>>>> 5.15.130 based kernel in the hope of improving the performance
> >>>>> for
> >>>>> our
> >>>>> data servers.
> >>>>> 
> >>>>> Our performance team recently ran a fio workload on a client
> >>>>> that
> >>>>> was
> >>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection
> >>>>> (infiniband) against that resulting server. I've attached the
> >>>>> resulting
> >>>>> flame graph from a perf profile run on the server side.
> >>>>> 
> >>>>> Is anyone else seeing this massive contention for the spin lock
> >>>>> in
> >>>>> __lwq_dequeue? As you can see, it appears to be dwarfing all
> >>>>> the
> >>>>> other
> >>>>> nfsd activity on the system in question here, being responsible
> >>>>> for
> >>>>> 45%
> >>>>> of all the perf hits.

Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock.

llist_reverse_order() is an O(n) algorithm involving full length
linked list traversal. IOWs, it's a worst case cache miss algorithm
running under a spin lock. And then consider what happens when
enqueue processing is faster than dequeue processing.

This means the depth of the queue grows, and ultimate length of the
queue is unbound. Because fo the batch processing nature of lwq -
it takes ->new, reverses it and places it in ->ready - the length of
the list that needs reversing ends up growing every batch that
we queue faster than we dequeue. Unbound processing queues are bad
even when they have O(1) behaviour. lwq has O(n) worst case
behaviour, and that makes this even worse...

Regardless, The current lwq could be slightly improved - the
lockless enqueue competes for the same cacheline as the dequeue
serialisation lock.

struct lwq {
        spinlock_t              lock;
        struct llist_node       *ready;         /* entries to be dequeued */
        struct llist_head       new;            /* entries being enqueued */
};

Adding __cacheline_aligned_in_smp to ->new (the enqueue side) might
help reduce this enqueue/dequeue cacheline contention a bit by
separating them onto different cachelines. That will push the point
of catastrophic breakdown out a little bit, not solve the issue of
queue depth based batch processing on the dequeue side.

I suspect a lockless ring buffer might be a more scalable solution
for the nfsd...

> >>>> I haven't seen that, but I've been working on other issues.
> >>>> 
> >>>> What's the nfsd thread count on your test server? Have you
> >>>> seen a similar impact on 6.10 kernels ?
> >>>> 
> >>> 
> >>> 640 knfsd threads. The machine was a supermicro 2029BT-HNR with
> >>> 2xIntel
> >>> 6150, 384GB of memory and 6xWDC SN840.
> >>> 
> >>> Unfortunately, the machine was a loaner, so cannot compare to 6.10.
> >>> That's why I was asking if anyone has seen anything similar.
> >> 
> >> If this system had more than one NUMA node, then using
> >> svc's "numa pool" mode might have helped.

It's a dual socket machine so it has at least 2 physical nodes. Of
course, the bios has to be configured to expose this as a NUMA
machine and not a "legacy SMP" machine for the OS to know that, but
I'm pretty sure that's been the typical server bios defaults for
quite a few years now.

Even if it wasn't a dual socket machine, the CPU itself is a NUMA
architecture.  i.e. just about every server x86-64 CPU sold these
days is a NUMA SOC - neither core-to-core or core-to-memory latency
is uniform within a socket these days. Desktop CPUs are also well
down this track, too.

Intel exposes the details of the topology within a socket via the
bios option known as "sub-numa clustering". This exposes the full
sub-socket CPU and memory topology to the OS, so it is fully aware
of both the on-chip and off-chip topology.

Using sub-numa clustering means we don't end up with 32+ CPUS to a
per-socket numa node. Numa scalability algorithms largely rely on
keeping the cores-per-numa-node ratio in check. Sub-numa
clustering enables the OS to keep this ratio down to reasonable
levels.

> > Interesting. I had forgotten about that setting.
> > 
> > Just out of curiosity, is there any reason why we might not want to
> > default to that mode on a NUMA enabled system?
> 
> Can't think of one off hand. Maybe back in the day it was
> hard to tell when you were actually /on/ a NUMA system.

As per above, I would assume that the kernel is *always* running on
a NUMA machine. IOWs, if CONFIG_NUMA is enabled (which it is on just
about every distro kernel these days), then we should be using NUMA
optimisations by default.

If the machine is not a NUMA machine (e.g. single socket, sub-numa
clustering off), then the NUMA subsystem will be initialised with
nr_online_nodes = 1 (i.e. a single active node) and numa aware
algorithms should just behave as if there is a single global node.
If CONFIG_NUMA=n, then nr_online_nodes is hard coded to 1.

Hence subsystems only need to implement a single algorithm that is
NUMA aware. The bios/system config will tell the kernel how many
nodes there are, and the NUMA algorithms will just do the right
thing because nr_online_nodes=1 in those situations..

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx