> On Jul 16, 2024, at 2:49 PM, Tom Talpey <tom@xxxxxxxxxx> wrote: > > On 7/16/2024 9:31 AM, Chuck Lever III wrote: >>> On Jul 16, 2024, at 7:00 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>> >>> On Tue, 2024-07-16 at 13:21 +1000, NeilBrown wrote: >>>> On Tue, 16 Jul 2024, Jeff Layton wrote: >>>>> On Mon, 2024-07-15 at 17:14 +1000, NeilBrown wrote: >>>>>> A future patch will allow the number of threads in each nfsd pool to >>>>>> vary dynamically. >>>>>> The lower bound will be the number explicit requested via >>>>>> /proc/fs/nfsd/threads or /proc/fs/nfsd/pool_threads >>>>>> >>>>>> The upper bound can be set in each net-namespace by writing >>>>>> /proc/fs/nfsd/max_threads. This upper bound applies across all pools, >>>>>> there is no per-pool upper limit. >>>>>> >>>>>> If no upper bound is set, then one is calculated. A global upper limit >>>>>> is chosen based on amount of memory. This limit only affects dynamic >>>>>> changes. Static configuration can always over-ride it. >>>>>> >>>>>> We track how many threads are configured in each net namespace, with the >>>>>> max or the min. We also track how many net namespaces have nfsd >>>>>> configured with only a min, not a max. >>>>>> >>>>>> The difference between the calculated max and the total allocation is >>>>>> available to be shared among those namespaces which don't have a maximum >>>>>> configured. Within a namespace, the available share is distributed >>>>>> equally across all pools. >>>>>> >>>>>> In the common case there is one namespace and one pool. A small number >>>>>> of threads are configured as a minimum and no maximum is set. In this >>>>>> case the effective maximum will be directly based on total memory. >>>>>> Approximately 8 per gigabyte. >>>>>> >>>>> >>>>> >>>>> Some of this may come across as bikeshedding, but I'd probably prefer >>>>> that this work a bit differently: >>>>> >>>>> 1/ I don't think we should enable this universally -- at least not >>>>> initially. What I'd prefer to see is a new pool_mode for the dynamic >>>>> threadpools (maybe call it "dynamic"). That gives us a clear opt-in >>>>> mechanism. Later once we're convinced it's safe, we can make "dynamic" >>>>> the default instead of "global". >>>>> >>>>> 2/ Rather than specifying a max_threads value separately, why not allow >>>>> the old threads/pool_threads interface to set the max and just have a >>>>> reasonable minimum setting (like the current default of 8). Since we're >>>>> growing the threadpool dynamically, I don't see why we need to have a >>>>> real configurable minimum. >>>>> >>>>> 3/ the dynamic pool-mode should probably be layered on top of the >>>>> pernode pool mode. IOW, in a NUMA configuration, we should split the >>>>> threads across NUMA nodes. >>>> >>>> Maybe we should start by discussing the goal. What do we want >>>> configuration to look like when we finish? >>>> >>>> I think we want it to be transparent. Sysadmin does nothing, and it all >>>> works perfectly. Or as close to that as we can get. >>>> >>> >>> That's a nice eventual goal, but what do we do if we make this change >>> and it's not behaving for them? We need some way for them to revert to >>> traditional behavior if the new mode isn't working well. >> As Steve pointed out (privately) there are likely to be cases >> where the dynamic thread count adjustment creates too many >> threads or somehow triggers a DoS. Admins want the ability to >> disable new features that cause trouble, and it is impossible >> for us to to say truthfully that we have predicted every >> misbehavior. >> So +1 for having a mechanism for getting back the traditional >> behavior, at least until we have confidence it is not going >> to have troubling side-effects. > > +1 on a configurable maximum as well, but I'll add a concern about > the NUMA node thing. > > Not all CPU cores are created equal any more, there are "performance" > and "efficiency" (Atom) cores and there can be a big difference. Also > there are NUMA nodes with no CPUs at all, memory-only for example. > Then, CXL scrambles the topology again. I think it wouldn't be difficult to make the svc_pool_map skip creating svc thread pools on NUMA nodes with no CPUs. And perhaps the min/max settings need to be per pool? But the idea with dynamic thread pool sizing is that if a pool (or node) is not getting NFS traffic, then its thread pool will not grow. > Let's not forget that these nfsd threads call into the filesystems, > which may desire very different NUMA affinities, for example the nfsd > protocol side may prefer to be near the network adapter, while the > filesystem side, the storage. And RDMA can bypass memory copy costs. Agreed, these issues still require administrator attention when configuring a high performance system. > Thread count only addresses a fraction of these. > >> Yes, in a perfect world, fully autonomous thread count >> adjustment would be amazing. Let's aim for that, but take >> baby steps to get there. > > Amazing indeed, and just as unlikely to be perfect. Caution is good. > > Tom. -- Chuck Lever