Re: knfsd performance

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 19 Jun 2024 12:56:44 +1000

On Tue, Jun 18, 2024 at 11:26:22PM +0000, Chuck Lever III wrote:
> > On Jun 18, 2024, at 7:17 PM, NeilBrown <neilb@xxxxxxx> wrote:
> >> Eventually we'd like to make the thread poos dynamic, at which point
> >> making that the default becomes much simpler from an administrative
> >> standpoint.
> > 
> > I agree that dynamic thread pools will make numa management simpler.
> > Greg Banks did the numa work for SGI - I wonder where he is now.  He was
> > at fastmail 10 years ago..
>
> Dave (cc'd) designed it with Greg, Greg implemented it.

[ I'll dump a bit of history about the NUMA nfsd architecture at the
end. ]

> > The idea was to bind network interfaces to numa nodes with interrupt
> > routing.  There was no expectation that work would be distributed evenly
> > across all nodes. Some might be dedicated to non-nfs work.  So there was
> > expected to be non-trivial configuration for both IRQ routing and
> > threads-per-node.  If we can make threads-per-node demand-based, then
> > half the problem goes away.

Right.

For the dynamic thread pool stuff, the grow side was a simple
heuristic: when we dequeued a request, we checked if the request
queue was empty, if there were idle nfsd threads and whether we were
under the max thread count.  i.e. If we had more work to do and no
idle workers to do it, we forked another nfsd thread to do the work.

I don't recall exactly what Greg implemented on Linux for the shrink
side. On Irix, the nfsd would record the time at which it completed
it's last request processing, we fired a timer every 30s or
so to walk the nfsd status array. If we found an nfsd with a
completion time older than 30s, the nfsd got reaped. 30s was long
enough to handle bursty loads, but short enough that people didn't
complain about having hundreds of nfsds sitting around....

This is basically a very simple version of what workqueues do for us
now.

That is, if we just make the nfsd request work be based on per-node, node
affine, unbound work queues, then thread scaling comes along for
free. I think that workqueues support this per-node thread pool
affinity natively now:

enum wq_affn_scope {
        WQ_AFFN_DFL,                    /* use system default */
        WQ_AFFN_CPU,                    /* one pod per CPU */
        WQ_AFFN_SMT,                    /* one pod poer SMT */
        WQ_AFFN_CACHE,                  /* one pod per LLC */
>>>>>   WQ_AFFN_NUMA,                   /* one pod per NUMA node */
        WQ_AFFN_SYSTEM,                 /* one pod across the whole system */

        WQ_AFFN_NR_TYPES,
};

I'm not sure that the NFS server needs to reinvent the wheel here...

> Network devices (and storage devices) are affined to one
> NUMA node.

NVMe storage devices don't need to be affine to the node. They just
need to have a hardware queue assigned to each node so that
node-local IO always hits the same hardware queue and gets
completion interrupts returned to that same node.

And, yes, this is something that still has to be configured
manually, too.

> If the nfsd threads are not on the same node
> as the network device, there is a significant penalty.
>
> I have a two-node system here, and it performs consistently
> well when I put it in pool-mode=numa and affine the network
> device's IRQs to one node.
>
> It even works with two network devices (one per node) --
> each device gets its own set of nfsd threads.

Right. But this is all orthogonal to solving the problem of demand
based thread pool scaling.

> I don't think the pool_mode needs to be demand based. If
> the system is a NUMA system, it makes sense to split up
> the thread pools and put our pencils down. The only other
> step that is needed is proper IRQ affinity settings for
> the network devices.

I think it's better for everyone if the system automatically scales
with demand, regardless of whether it's a NUMA system or not, and
regardless of whether the proper NUMA affinity has been configured
or not.

> > We could even default to one-thread-pool-per-CPU if there are more than
> > X cpus....
> 
> I've never seen a performance improvement in the per-cpu
> pool mode, fwiw.

We're not doing anything inherently per-cpu in processing an NFS
request, so I can only see downsides to trying to restrict incoming
processing to per-cpu queues. i.e. if the CPU can't handle all the
incoming requests, what processes the per-cpu request backlog? At
least with per-node queues, we have many cpus to through at the one
incoming request queue and we are much less likely to get backlogged
and starve the request queue of processing resources...

Cheers,

Dave.

-----

History....

I did the original NUMA NFS server architecture and implementation
work on Irix Origin NUMA machines. Greg took that architecture and
made it work on Linux Altix NUMA machines. The differences in OS
implementation and the general lack of NUMA and hardware topology
support in Linux meant that things had to be done differently on
Linux. The Linux project was a much bigger and complex undertaking,
and Greg did most of that work.

The original Irix architecture was per-node NICs and FC HBAs.  The
per-node HBAs were attached to shared storage via a multi-path FC
SAN, and the NICS were bonded to the local switch and load balanced.
Every node had the same network and storage bandwdith - I think it
was 4x1GbE and 2x2Gb FC ports per 8p/32GB node.

IOWs, roughly 400MB/s per numa node to/from network, to/from disk.

On Irix, the NICs and storage hardware were always configured at
startup to be affine to the nearest CPU node. The OS knew the
physical topology of every piece of hardware in the system, and knew
what NUMA node they should be bound to. Nothing needed manual
configuration to be node affine.

Because we were dealing with networks of thousands of NFS clients
(think 2000s era renderwalls) we used ip/port hash based load
balancing across all the NIC ports in the machine. This effectively
always drove the requests from a single client on the network to a
specific NIC on the server. With the NICs being bound to a specific
node, we essentially sharded per-client information into per-node
structures (e.g. duplicate request detection). Hence most requests
rarely need cross-node structure access. Incoming requests were
processed on the local node, and any IO that the local NFSd needed
to do through XFS was sent directly out the HBAs on the local node
and completions were signalled back to that node.

Essentially, all the requests, the processing and IO for a specific
client ended up being bound to a specific NUMA node. The largest
Irix machines we shipped as dedicated NAS boxes were 4 node O300
machines. They had 32 600Mhz r18k MIPS CPUs, 192GB RAM, 16 GbE
ports, 8xFC HBA ports and they scaled out to a bit over 1.2GB/s
to/from the network to/from storage. That doesn't seem like much
these days, but this was back in 2005....

There was relatively little that needed to change in Irix to do
this. It largely had all the functionality it already needed, and so
it was largely a simple matter of connecting existing dots once the
overall architecture was worked out.

OTOH, at the time Linux didn't really have a concept of physical
hardware topology. All interrupt vectoring had to be set up manually
in userspace to point them at the right node. The network drivers
didn't use node affine memory to fill receive rings. The bonding
driver didn't support the whacky arp games the Irix driver
utilised to do it's thing. The network stack performance was a long
way behind Irix. The nfsd didn't support dynamic worker thread
instantiation. The scheduler didn't really understand numa topology
or node affinity very well. The list went on.

These were some of the problems that Greg and other SGI people
addressed - that was *much* more work that what I had to do
originally to get this all to work on Irix, even though the overall
architecture was largely the same.

-- 
Dave Chinner
david@xxxxxxxxxxxxx