> On Jun 18, 2024, at 10:56 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, Jun 18, 2024 at 11:26:22PM +0000, Chuck Lever III wrote: >>> On Jun 18, 2024, at 7:17 PM, NeilBrown <neilb@xxxxxxx> wrote: >>>> Eventually we'd like to make the thread poos dynamic, at which point >>>> making that the default becomes much simpler from an administrative >>>> standpoint. >>> >>> I agree that dynamic thread pools will make numa management simpler. >>> Greg Banks did the numa work for SGI - I wonder where he is now. He was >>> at fastmail 10 years ago.. >> >> Dave (cc'd) designed it with Greg, Greg implemented it. > > [ I'll dump a bit of history about the NUMA nfsd architecture at the > end. ] > >>> The idea was to bind network interfaces to numa nodes with interrupt >>> routing. There was no expectation that work would be distributed evenly >>> across all nodes. Some might be dedicated to non-nfs work. So there was >>> expected to be non-trivial configuration for both IRQ routing and >>> threads-per-node. If we can make threads-per-node demand-based, then >>> half the problem goes away. > > Right. > > For the dynamic thread pool stuff, the grow side was a simple > heuristic: when we dequeued a request, we checked if the request > queue was empty, if there were idle nfsd threads and whether we were > under the max thread count. i.e. If we had more work to do and no > idle workers to do it, we forked another nfsd thread to do the work. > > I don't recall exactly what Greg implemented on Linux for the shrink > side. On Irix, the nfsd would record the time at which it completed > it's last request processing, we fired a timer every 30s or > so to walk the nfsd status array. If we found an nfsd with a > completion time older than 30s, the nfsd got reaped. 30s was long > enough to handle bursty loads, but short enough that people didn't > complain about having hundreds of nfsds sitting around.... > > This is basically a very simple version of what workqueues do for us > now. > > That is, if we just make the nfsd request work be based on per-node, node > affine, unbound work queues, then thread scaling comes along for > free. I think that workqueues support this per-node thread pool > affinity natively now: > > enum wq_affn_scope { > WQ_AFFN_DFL, /* use system default */ > WQ_AFFN_CPU, /* one pod per CPU */ > WQ_AFFN_SMT, /* one pod poer SMT */ > WQ_AFFN_CACHE, /* one pod per LLC */ >>>>>> WQ_AFFN_NUMA, /* one pod per NUMA node */ > WQ_AFFN_SYSTEM, /* one pod across the whole system */ > > WQ_AFFN_NR_TYPES, > }; Note that wq_affn_scope appeared only in the last one or two kernel releases. Previously there was no way to control workqueue cache affinity. I have observed significant improvement in throughput scalability on the Linux NFS client with the new default workqueue cache affinity behavior (with NFS/RDMA, of course). The issue is, interestingly enough, the same as it was for lwq: spinlock contention is reduced once the set of threads being scheduled all belong to the same hardware cache. > I'm not sure that the NFS server needs to reinvent the wheel here... Jeff implemented a workqueue-based svc thread scheduler about 10 years ago. We could never make it perform as well as the existing kthreads-based scheduler. I'm not sure we need a work-conserving thread scheduler for NFSD/SunRPC. That's why we took the route of keeping the bespoke thread scheduler in SunRPC this time. >> Network devices (and storage devices) are affined to one >> NUMA node. > > NVMe storage devices don't need to be affine to the node. They just > need to have a hardware queue assigned to each node so that > node-local IO always hits the same hardware queue and gets > completion interrupts returned to that same node. > > And, yes, this is something that still has to be configured > manually, too. > >> If the nfsd threads are not on the same node >> as the network device, there is a significant penalty. >> >> I have a two-node system here, and it performs consistently >> well when I put it in pool-mode=numa and affine the network >> device's IRQs to one node. >> >> It even works with two network devices (one per node) -- >> each device gets its own set of nfsd threads. > > Right. But this is all orthogonal to solving the problem of demand > based thread pool scaling. True, but it /is/ related to Trond's initial observation. The network devices need to be affined properly before pool_mode=numa is completely effective. >> I don't think the pool_mode needs to be demand based. If >> the system is a NUMA system, it makes sense to split up >> the thread pools and put our pencils down. The only other >> step that is needed is proper IRQ affinity settings for >> the network devices. > > I think it's better for everyone if the system automatically scales > with demand, regardless of whether it's a NUMA system or not, and > regardless of whether the proper NUMA affinity has been configured > or not. Read the mistakes that I wrote here ;-) I wasn't referring to thread count. As I responded later, I think automatically managing the pool thread count would be a good improvement. We've been discussing this off and on for a while. >>> We could even default to one-thread-pool-per-CPU if there are more than >>> X cpus.... >> >> I've never seen a performance improvement in the per-cpu >> pool mode, fwiw. > > We're not doing anything inherently per-cpu in processing an NFS > request, so I can only see downsides to trying to restrict incoming > processing to per-cpu queues. i.e. if the CPU can't handle all the > incoming requests, what processes the per-cpu request backlog? At > least with per-node queues, we have many cpus to through at the one > incoming request queue and we are much less likely to get backlogged > and starve the request queue of processing resources... I've proposed getting rid of pool_mode=cpu, because I can't find a winning use case for it. We erred on the conservative side and left it in place for now. In terms of making pool_mode=numa the default setting, that will work in many cases, but in some cases it will result in unpleasant surprises. The default out-of-the-shrink-wrap thread count is currently fixed at 8. In pool_mode=global, all 8 threads get used. In pool_mode=numa, the total thread count is split across the nodes. On a single node system, there will be no change. All is well. On a two-node system, though, each pool will get 4 threads, or only 2 threads each on a four-node system. That can have a surprising and undesirable performance impact. I don't know what will happen on a system with more nodes than nfsd threads. Thus I think there needs to be a solution (like dynamic thread count adjustment) in place to address thread count splitting before we can change the default pool mode. -- Chuck Lever