On Tue, Jun 18, 2024 at 11:26:22PM +0000, Chuck Lever III wrote: > > On Jun 18, 2024, at 7:17 PM, NeilBrown <neilb@xxxxxxx> wrote: > >> Eventually we'd like to make the thread poos dynamic, at which point > >> making that the default becomes much simpler from an administrative > >> standpoint. > > > > I agree that dynamic thread pools will make numa management simpler. > > Greg Banks did the numa work for SGI - I wonder where he is now. He was > > at fastmail 10 years ago.. > > Dave (cc'd) designed it with Greg, Greg implemented it. [ I'll dump a bit of history about the NUMA nfsd architecture at the end. ] > > The idea was to bind network interfaces to numa nodes with interrupt > > routing. There was no expectation that work would be distributed evenly > > across all nodes. Some might be dedicated to non-nfs work. So there was > > expected to be non-trivial configuration for both IRQ routing and > > threads-per-node. If we can make threads-per-node demand-based, then > > half the problem goes away. Right. For the dynamic thread pool stuff, the grow side was a simple heuristic: when we dequeued a request, we checked if the request queue was empty, if there were idle nfsd threads and whether we were under the max thread count. i.e. If we had more work to do and no idle workers to do it, we forked another nfsd thread to do the work. I don't recall exactly what Greg implemented on Linux for the shrink side. On Irix, the nfsd would record the time at which it completed it's last request processing, we fired a timer every 30s or so to walk the nfsd status array. If we found an nfsd with a completion time older than 30s, the nfsd got reaped. 30s was long enough to handle bursty loads, but short enough that people didn't complain about having hundreds of nfsds sitting around.... This is basically a very simple version of what workqueues do for us now. That is, if we just make the nfsd request work be based on per-node, node affine, unbound work queues, then thread scaling comes along for free. I think that workqueues support this per-node thread pool affinity natively now: enum wq_affn_scope { WQ_AFFN_DFL, /* use system default */ WQ_AFFN_CPU, /* one pod per CPU */ WQ_AFFN_SMT, /* one pod poer SMT */ WQ_AFFN_CACHE, /* one pod per LLC */ >>>>> WQ_AFFN_NUMA, /* one pod per NUMA node */ WQ_AFFN_SYSTEM, /* one pod across the whole system */ WQ_AFFN_NR_TYPES, }; I'm not sure that the NFS server needs to reinvent the wheel here... > Network devices (and storage devices) are affined to one > NUMA node. NVMe storage devices don't need to be affine to the node. They just need to have a hardware queue assigned to each node so that node-local IO always hits the same hardware queue and gets completion interrupts returned to that same node. And, yes, this is something that still has to be configured manually, too. > If the nfsd threads are not on the same node > as the network device, there is a significant penalty. > > I have a two-node system here, and it performs consistently > well when I put it in pool-mode=numa and affine the network > device's IRQs to one node. > > It even works with two network devices (one per node) -- > each device gets its own set of nfsd threads. Right. But this is all orthogonal to solving the problem of demand based thread pool scaling. > I don't think the pool_mode needs to be demand based. If > the system is a NUMA system, it makes sense to split up > the thread pools and put our pencils down. The only other > step that is needed is proper IRQ affinity settings for > the network devices. I think it's better for everyone if the system automatically scales with demand, regardless of whether it's a NUMA system or not, and regardless of whether the proper NUMA affinity has been configured or not. > > We could even default to one-thread-pool-per-CPU if there are more than > > X cpus.... > > I've never seen a performance improvement in the per-cpu > pool mode, fwiw. We're not doing anything inherently per-cpu in processing an NFS request, so I can only see downsides to trying to restrict incoming processing to per-cpu queues. i.e. if the CPU can't handle all the incoming requests, what processes the per-cpu request backlog? At least with per-node queues, we have many cpus to through at the one incoming request queue and we are much less likely to get backlogged and starve the request queue of processing resources... Cheers, Dave. ----- History.... I did the original NUMA NFS server architecture and implementation work on Irix Origin NUMA machines. Greg took that architecture and made it work on Linux Altix NUMA machines. The differences in OS implementation and the general lack of NUMA and hardware topology support in Linux meant that things had to be done differently on Linux. The Linux project was a much bigger and complex undertaking, and Greg did most of that work. The original Irix architecture was per-node NICs and FC HBAs. The per-node HBAs were attached to shared storage via a multi-path FC SAN, and the NICS were bonded to the local switch and load balanced. Every node had the same network and storage bandwdith - I think it was 4x1GbE and 2x2Gb FC ports per 8p/32GB node. IOWs, roughly 400MB/s per numa node to/from network, to/from disk. On Irix, the NICs and storage hardware were always configured at startup to be affine to the nearest CPU node. The OS knew the physical topology of every piece of hardware in the system, and knew what NUMA node they should be bound to. Nothing needed manual configuration to be node affine. Because we were dealing with networks of thousands of NFS clients (think 2000s era renderwalls) we used ip/port hash based load balancing across all the NIC ports in the machine. This effectively always drove the requests from a single client on the network to a specific NIC on the server. With the NICs being bound to a specific node, we essentially sharded per-client information into per-node structures (e.g. duplicate request detection). Hence most requests rarely need cross-node structure access. Incoming requests were processed on the local node, and any IO that the local NFSd needed to do through XFS was sent directly out the HBAs on the local node and completions were signalled back to that node. Essentially, all the requests, the processing and IO for a specific client ended up being bound to a specific NUMA node. The largest Irix machines we shipped as dedicated NAS boxes were 4 node O300 machines. They had 32 600Mhz r18k MIPS CPUs, 192GB RAM, 16 GbE ports, 8xFC HBA ports and they scaled out to a bit over 1.2GB/s to/from the network to/from storage. That doesn't seem like much these days, but this was back in 2005.... There was relatively little that needed to change in Irix to do this. It largely had all the functionality it already needed, and so it was largely a simple matter of connecting existing dots once the overall architecture was worked out. OTOH, at the time Linux didn't really have a concept of physical hardware topology. All interrupt vectoring had to be set up manually in userspace to point them at the right node. The network drivers didn't use node affine memory to fill receive rings. The bonding driver didn't support the whacky arp games the Irix driver utilised to do it's thing. The network stack performance was a long way behind Irix. The nfsd didn't support dynamic worker thread instantiation. The scheduler didn't really understand numa topology or node affinity very well. The list went on. These were some of the problems that Greg and other SGI people addressed - that was *much* more work that what I had to do originally to get this all to work on Irix, even though the overall architecture was largely the same. -- Dave Chinner david@xxxxxxxxxxxxx