Re: [PATCH 13/14] nfsd: introduce concept of a maximum number of threads.

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 17 Jul 2024 15:24:13 +0000

> On Jul 16, 2024, at 2:49 PM, Tom Talpey <tom@xxxxxxxxxx> wrote:
> 
> On 7/16/2024 9:31 AM, Chuck Lever III wrote:
>>> On Jul 16, 2024, at 7:00 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>> 
>>> On Tue, 2024-07-16 at 13:21 +1000, NeilBrown wrote:
>>>> On Tue, 16 Jul 2024, Jeff Layton wrote:
>>>>> On Mon, 2024-07-15 at 17:14 +1000, NeilBrown wrote:
>>>>>> A future patch will allow the number of threads in each nfsd pool to
>>>>>> vary dynamically.
>>>>>> The lower bound will be the number explicit requested via
>>>>>> /proc/fs/nfsd/threads or /proc/fs/nfsd/pool_threads
>>>>>> 
>>>>>> The upper bound can be set in each net-namespace by writing
>>>>>> /proc/fs/nfsd/max_threads.  This upper bound applies across all pools,
>>>>>> there is no per-pool upper limit.
>>>>>> 
>>>>>> If no upper bound is set, then one is calculated.  A global upper limit
>>>>>> is chosen based on amount of memory.  This limit only affects dynamic
>>>>>> changes. Static configuration can always over-ride it.
>>>>>> 
>>>>>> We track how many threads are configured in each net namespace, with the
>>>>>> max or the min.  We also track how many net namespaces have nfsd
>>>>>> configured with only a min, not a max.
>>>>>> 
>>>>>> The difference between the calculated max and the total allocation is
>>>>>> available to be shared among those namespaces which don't have a maximum
>>>>>> configured.  Within a namespace, the available share is distributed
>>>>>> equally across all pools.
>>>>>> 
>>>>>> In the common case there is one namespace and one pool.  A small number
>>>>>> of threads are configured as a minimum and no maximum is set.  In this
>>>>>> case the effective maximum will be directly based on total memory.
>>>>>> Approximately 8 per gigabyte.
>>>>>> 
>>>>> 
>>>>> 
>>>>> Some of this may come across as bikeshedding, but I'd probably prefer
>>>>> that this work a bit differently:
>>>>> 
>>>>> 1/ I don't think we should enable this universally -- at least not
>>>>> initially. What I'd prefer to see is a new pool_mode for the dynamic
>>>>> threadpools (maybe call it "dynamic"). That gives us a clear opt-in
>>>>> mechanism. Later once we're convinced it's safe, we can make "dynamic"
>>>>> the default instead of "global".
>>>>> 
>>>>> 2/ Rather than specifying a max_threads value separately, why not allow
>>>>> the old threads/pool_threads interface to set the max and just have a
>>>>> reasonable minimum setting (like the current default of 8). Since we're
>>>>> growing the threadpool dynamically, I don't see why we need to have a
>>>>> real configurable minimum.
>>>>> 
>>>>> 3/ the dynamic pool-mode should probably be layered on top of the
>>>>> pernode pool mode. IOW, in a NUMA configuration, we should split the
>>>>> threads across NUMA nodes.
>>>> 
>>>> Maybe we should start by discussing the goal.  What do we want
>>>> configuration to look like when we finish?
>>>> 
>>>> I think we want it to be transparent.  Sysadmin does nothing, and it all
>>>> works perfectly.  Or as close to that as we can get.
>>>> 
>>> 
>>> That's a nice eventual goal, but what do we do if we make this change
>>> and it's not behaving for them? We need some way for them to revert to
>>> traditional behavior if the new mode isn't working well.
>> As Steve pointed out (privately) there are likely to be cases
>> where the dynamic thread count adjustment creates too many
>> threads or somehow triggers a DoS. Admins want the ability to
>> disable new features that cause trouble, and it is impossible
>> for us to to say truthfully that we have predicted every
>> misbehavior.
>> So +1 for having a mechanism for getting back the traditional
>> behavior, at least until we have confidence it is not going
>> to have troubling side-effects.
> 
> +1 on a configurable maximum as well, but I'll add a concern about
> the NUMA node thing.
> 
> Not all CPU cores are created equal any more, there are "performance"
> and "efficiency" (Atom) cores and there can be a big difference. Also
> there are NUMA nodes with no CPUs at all, memory-only for example.
> Then, CXL scrambles the topology again.

I think it wouldn't be difficult to make the svc_pool_map skip
creating svc thread pools on NUMA nodes with no CPUs. And perhaps
the min/max settings need to be per pool?

But the idea with dynamic thread pool sizing is that if a pool
(or node) is not getting NFS traffic, then its thread pool will
not grow.

> Let's not forget that these nfsd threads call into the filesystems,
> which may desire very different NUMA affinities, for example the nfsd
> protocol side may prefer to be near the network adapter, while the
> filesystem side, the storage. And RDMA can bypass memory copy costs.

Agreed, these issues still require administrator attention when
configuring a high performance system.

> Thread count only addresses a fraction of these.
> 
>> Yes, in a perfect world, fully autonomous thread count
>> adjustment would be amazing. Let's aim for that, but take
>> baby steps to get there.
> 
> Amazing indeed, and just as unlikely to be perfect. Caution is good.
> 
> Tom.

--
Chuck Lever