Re: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 26 Jan 2024 15:16:14 +0000

> On Jan 26, 2024, at 10:03 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Fri, 2024-01-26 at 14:27 +0000, Chuck Lever III wrote:
>> 
>>> On Jan 26, 2024, at 9:08 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>> 
>>> On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
>>>> 
>>>>> On Jan 26, 2024, at 8:01 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>>>> 
>>>>> On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
>>>>>> On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
>>>>>>> On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
>>>>>>>> This is the last global stat, move it into nfsd_net and adjust all the
>>>>>>>> users to use that variant instead of the global one.
>>>>>>> 
>>>>>>> Hm. I thought nfsd threads were a global resource -- they service
>>>>>>> all network namespaces. So, shouldn't the same thread count be
>>>>>>> surfaced to all containers? Won't they all see all of the nfsd
>>>>>>> processes?
>>>>> 
>>>>> Each container is going to start /proc/fs/nfsd/threads number of threads
>>>>> regardless. I hadn't actually grokked that they just get tossed onto the
>>>>> pile of threads that service requests.
>>>>> 
>>>>> Is is possible for one container to start a small number of threads but
>>>>> have its client load be such that it spills over and ends up stealing
>>>>> threads from other containers?
>>>> 
>>>> I haven't seen any code that manages resources based on namespace,
>>>> except in filecache.c to restrict writeback per namespace.
>>>> 
>>>> My impression is that any nfsd thread can serve any namespace. I'm
>>>> not sure it is currently meaningful for a particular net namespace to
>>>> "create" more threads.
>>>> 
>>>> If someone would like that level of control, we could implement a
>>>> cgroup mechanism and have one or more separate svc_pools per net
>>>> namespace, maybe? </hand wave>
>>>> 
>>> 
>>> AFAICT, the total number of threads on the system will be the sum of the
>>> threads started in each of the containers. They do just go into a big
>>> pile, and whichever one wakes up will service the request, so the
>>> threads aren't associated with the netns, per-se. The svc_rqst's however
>>> _are_ per-netns. So, I don't see anything that ensures that a container
>>> doesn't exceed the number of threads it started on its own behalf.
>>> 
>>> <hand wave>
>>> I'm not sure we'd need to tie this in to cgroups.
>> 
>> Not a need, but cgroups are typically the way that such
>> resource constraints are managed, that's all.
>> 
>> 
>>> Now that Josef is
>>> moving some of these key structures to be per-net, it should be fairly
>>> simple to have nfsd() just look at the th_cnt and the thread count in
>>> the current namespace, and just enqueue the RPC rather than doing it?
>>> </hand wave>
>>> 
>>> OTOH, maybe I'm overly concerned here.
>>> 
>>> 
>>>> 
>>>>>> I don't think we want the network namespaces seeing how many threads exist in
>>>>>> the entire system right?
>>>> 
>>>> If someone in a non-init net namespace does a "pgrep -c nfsd" don't
>>>> they see the total nfsd thread count for the host?
>>>> 
>>> 
>>> Yes, they're kernel threads and they aren't associated with a particular
>>> pid namespace.
>>> 
>>>> 
>>>>>> Additionally it appears that we can have multiple threads per network namespace,
>>>>>> so it's not like this will just show 1 for each individual nn, it'll show
>>>>>> however many threads have been configured for that nfsd in that network
>>>>>> namespace.
>>>> 
>>>> I've never tried this, so I'm speculating. But it seems like for
>>>> now, because all nfsd threads can serve all namespaces, they should
>>>> all see the global thread count stat.
>>>> 
>>>> Then later we can refine it.
>>>> 
>>> 
>>> I don't think that info is particularly useful though, and it certainly
>>> breaks expectations wrt container isolation.
>>> 
>>> Look at it this way:
>>> 
>>> Suppose I have access to a container and I spin up nfsd with a
>>> particular number of threads. I now want to know "did I spin up enough
>>> threads?"
>> 
>> It makes sense to me for a container to ask for one or more
>> NFSD listeners. But I'm not yet clear on what it means for
>> a container to try to alter the NFSD thread count, which is
>> currently global.
>> 
> 
> write_threads sets the number of nfsd threads for the svc_serv, which is
> a per-net object, so serv->sv_nrthreads is only counting the threads for
> its namespace.

OK. I missed the part where struct svc_serv is per-net, and
each svc_serv has its own thread pool. Got it.

> Each container that starts an nfsd will contribute "threads" number of
> nfsds to the pile. When you set "threads" to a different number, the
> total pile is grown or reduced accordingly. In fact, I'm not even sure
> we keep a systemwide total.
> 
> What _isn't_ done (unless I'm missing something) is any sort of
> limitation on the use of those threads. So you could have a container
> that starts one nfsd thread, but its clients can still have many more
> RPCs in flight, up to the total number of nfsd's running on the host.
> Limiting that might be possible, but I'm not yet convinced that it's a
> good idea.
> 
> It might be interesting to gather some metrics though. How often do the
> number of RPCs in flight exceed the number of threads that we started in
> a namespace? Might also be good to gather a high-water mark too?

I had a bunch of trace points queued up for this purpose,
and all that got thrown out by Neil's recent work fixing
up our pool thread scheduler.

I'm not sure that queued requests are all that meaningful
or even quantifiable. A more meaningful metric is round
trip time measured on clients.

>>> By making this per-namespace as Josef suggests it should be
>>> fairly simple to tell whether my clients are regularly overrunning the
>>> threads I started. With this info as global, I have no idea what netns
>>> the RPCs being counted are against. I can't do anything with that info.
>> 
>> That's true. I'm just not sure we need to do anything
>> significant about it as part of this patch series.
>> 
>> I'm not at all advocating leaving it like this. I agree it
>> needs updating.
>> 
>> For now, just fill in that thread count stat with the global
>> thread count, and later when we have a better concept for
>> what it means to "adjust the nfsd thread count from inside a
>> container" we can come back and make it make sense.
> 
> It would be good to step back and decide how we think this should work.
> What would this look like if we were designing it today? Then we can
> decide how to get there.
> 
> Personally, I think keeping the view inside the container as isolated as
> possible is the right thing to do.

I agree. I would like to get Josef's patches in quickly too.

Should that metric report svc_serv::sv_nrthreads instead of
the global thread count?

--
Chuck Lever