Re: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 26 Jan 2024 10:35:54 -0500

On Fri, 2024-01-26 at 15:16 +0000, Chuck Lever III wrote:
> 
> > On Jan 26, 2024, at 10:03 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > 
> > On Fri, 2024-01-26 at 14:27 +0000, Chuck Lever III wrote:
> > > 
> > > > On Jan 26, 2024, at 9:08 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > 
> > > > On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
> > > > > 
> > > > > > On Jan 26, 2024, at 8:01 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > > 
> > > > > > On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
> > > > > > > On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> > > > > > > > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > > > > > > > > This is the last global stat, move it into nfsd_net and adjust all the
> > > > > > > > > users to use that variant instead of the global one.
> > > > > > > > 
> > > > > > > > Hm. I thought nfsd threads were a global resource -- they service
> > > > > > > > all network namespaces. So, shouldn't the same thread count be
> > > > > > > > surfaced to all containers? Won't they all see all of the nfsd
> > > > > > > > processes?
> > > > > > 
> > > > > > Each container is going to start /proc/fs/nfsd/threads number of threads
> > > > > > regardless. I hadn't actually grokked that they just get tossed onto the
> > > > > > pile of threads that service requests.
> > > > > > 
> > > > > > Is is possible for one container to start a small number of threads but
> > > > > > have its client load be such that it spills over and ends up stealing
> > > > > > threads from other containers?
> > > > > 
> > > > > I haven't seen any code that manages resources based on namespace,
> > > > > except in filecache.c to restrict writeback per namespace.
> > > > > 
> > > > > My impression is that any nfsd thread can serve any namespace. I'm
> > > > > not sure it is currently meaningful for a particular net namespace to
> > > > > "create" more threads.
> > > > > 
> > > > > If someone would like that level of control, we could implement a
> > > > > cgroup mechanism and have one or more separate svc_pools per net
> > > > > namespace, maybe? </hand wave>
> > > > > 
> > > > 
> > > > AFAICT, the total number of threads on the system will be the sum of the
> > > > threads started in each of the containers. They do just go into a big
> > > > pile, and whichever one wakes up will service the request, so the
> > > > threads aren't associated with the netns, per-se. The svc_rqst's however
> > > > _are_ per-netns. So, I don't see anything that ensures that a container
> > > > doesn't exceed the number of threads it started on its own behalf.
> > > > 
> > > > <hand wave>
> > > > I'm not sure we'd need to tie this in to cgroups.
> > > 
> > > Not a need, but cgroups are typically the way that such
> > > resource constraints are managed, that's all.
> > > 
> > > 
> > > > Now that Josef is
> > > > moving some of these key structures to be per-net, it should be fairly
> > > > simple to have nfsd() just look at the th_cnt and the thread count in
> > > > the current namespace, and just enqueue the RPC rather than doing it?
> > > > </hand wave>
> > > > 
> > > > OTOH, maybe I'm overly concerned here.
> > > > 
> > > > 
> > > > > 
> > > > > > > I don't think we want the network namespaces seeing how many threads exist in
> > > > > > > the entire system right?
> > > > > 
> > > > > If someone in a non-init net namespace does a "pgrep -c nfsd" don't
> > > > > they see the total nfsd thread count for the host?
> > > > > 
> > > > 
> > > > Yes, they're kernel threads and they aren't associated with a particular
> > > > pid namespace.
> > > > 
> > > > > 
> > > > > > > Additionally it appears that we can have multiple threads per network namespace,
> > > > > > > so it's not like this will just show 1 for each individual nn, it'll show
> > > > > > > however many threads have been configured for that nfsd in that network
> > > > > > > namespace.
> > > > > 
> > > > > I've never tried this, so I'm speculating. But it seems like for
> > > > > now, because all nfsd threads can serve all namespaces, they should
> > > > > all see the global thread count stat.
> > > > > 
> > > > > Then later we can refine it.
> > > > > 
> > > > 
> > > > I don't think that info is particularly useful though, and it certainly
> > > > breaks expectations wrt container isolation.
> > > > 
> > > > Look at it this way:
> > > > 
> > > > Suppose I have access to a container and I spin up nfsd with a
> > > > particular number of threads. I now want to know "did I spin up enough
> > > > threads?"
> > > 
> > > It makes sense to me for a container to ask for one or more
> > > NFSD listeners. But I'm not yet clear on what it means for
> > > a container to try to alter the NFSD thread count, which is
> > > currently global.
> > > 
> > 
> > write_threads sets the number of nfsd threads for the svc_serv, which is
> > a per-net object, so serv->sv_nrthreads is only counting the threads for
> > its namespace.
> 
> OK. I missed the part where struct svc_serv is per-net, and
> each svc_serv has its own thread pool. Got it.
> 
> 
> > Each container that starts an nfsd will contribute "threads" number of
> > nfsds to the pile. When you set "threads" to a different number, the
> > total pile is grown or reduced accordingly. In fact, I'm not even sure
> > we keep a systemwide total.
> > 
> > What _isn't_ done (unless I'm missing something) is any sort of
> > limitation on the use of those threads. So you could have a container
> > that starts one nfsd thread, but its clients can still have many more
> > RPCs in flight, up to the total number of nfsd's running on the host.
> > Limiting that might be possible, but I'm not yet convinced that it's a
> > good idea.
> > 
> > It might be interesting to gather some metrics though. How often do the
> > number of RPCs in flight exceed the number of threads that we started in
> > a namespace? Might also be good to gather a high-water mark too?
> 
> I had a bunch of trace points queued up for this purpose,
> and all that got thrown out by Neil's recent work fixing
> up our pool thread scheduler.
> 
> I'm not sure that queued requests are all that meaningful
> or even quantifiable. A more meaningful metric is round
> trip time measured on clients.
> 

I wouldn't try to queue anything just yet, but knowing when a particular
container is exceeding the number of threads it started seems like a
helpful metric. That could be done in a later patch though. I wouldn't
block this work on that.

> 
> > > > By making this per-namespace as Josef suggests it should be
> > > > fairly simple to tell whether my clients are regularly overrunning the
> > > > threads I started. With this info as global, I have no idea what netns
> > > > the RPCs being counted are against. I can't do anything with that info.
> > > 
> > > That's true. I'm just not sure we need to do anything
> > > significant about it as part of this patch series.
> > > 
> > > I'm not at all advocating leaving it like this. I agree it
> > > needs updating.
> > > 
> > > For now, just fill in that thread count stat with the global
> > > thread count, and later when we have a better concept for
> > > what it means to "adjust the nfsd thread count from inside a
> > > container" we can come back and make it make sense.
> > 
> > It would be good to step back and decide how we think this should work.
> > What would this look like if we were designing it today? Then we can
> > decide how to get there.
> > 
> > Personally, I think keeping the view inside the container as isolated as
> > possible is the right thing to do.
> 
> I agree. I would like to get Josef's patches in quickly too.
> 
> Should that metric report svc_serv::sv_nrthreads instead of
> the global thread count?
> 

No. The sv_nrthreads is just the same value that /proc/fs/nfsd/threads
shows. The _metric_ should show the th_cnt for that particular
namespace.

It would also be nice to know what the high-water mark for th_cnt is.
IOW, if my containers clients exceeded the number of threads I've
started, then it would be helpful to know by how much.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>