On Fri, 2024-01-26 at 15:16 +0000, Chuck Lever III wrote: > > > On Jan 26, 2024, at 10:03 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > On Fri, 2024-01-26 at 14:27 +0000, Chuck Lever III wrote: > > > > > > > On Jan 26, 2024, at 9:08 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > > > > > On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote: > > > > > > > > > > > On Jan 26, 2024, at 8:01 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > > > > > > > > > On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote: > > > > > > > On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote: > > > > > > > > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote: > > > > > > > > > This is the last global stat, move it into nfsd_net and adjust all the > > > > > > > > > users to use that variant instead of the global one. > > > > > > > > > > > > > > > > Hm. I thought nfsd threads were a global resource -- they service > > > > > > > > all network namespaces. So, shouldn't the same thread count be > > > > > > > > surfaced to all containers? Won't they all see all of the nfsd > > > > > > > > processes? > > > > > > > > > > > > Each container is going to start /proc/fs/nfsd/threads number of threads > > > > > > regardless. I hadn't actually grokked that they just get tossed onto the > > > > > > pile of threads that service requests. > > > > > > > > > > > > Is is possible for one container to start a small number of threads but > > > > > > have its client load be such that it spills over and ends up stealing > > > > > > threads from other containers? > > > > > > > > > > I haven't seen any code that manages resources based on namespace, > > > > > except in filecache.c to restrict writeback per namespace. > > > > > > > > > > My impression is that any nfsd thread can serve any namespace. I'm > > > > > not sure it is currently meaningful for a particular net namespace to > > > > > "create" more threads. > > > > > > > > > > If someone would like that level of control, we could implement a > > > > > cgroup mechanism and have one or more separate svc_pools per net > > > > > namespace, maybe? </hand wave> > > > > > > > > > > > > > AFAICT, the total number of threads on the system will be the sum of the > > > > threads started in each of the containers. They do just go into a big > > > > pile, and whichever one wakes up will service the request, so the > > > > threads aren't associated with the netns, per-se. The svc_rqst's however > > > > _are_ per-netns. So, I don't see anything that ensures that a container > > > > doesn't exceed the number of threads it started on its own behalf. > > > > > > > > <hand wave> > > > > I'm not sure we'd need to tie this in to cgroups. > > > > > > Not a need, but cgroups are typically the way that such > > > resource constraints are managed, that's all. > > > > > > > > > > Now that Josef is > > > > moving some of these key structures to be per-net, it should be fairly > > > > simple to have nfsd() just look at the th_cnt and the thread count in > > > > the current namespace, and just enqueue the RPC rather than doing it? > > > > </hand wave> > > > > > > > > OTOH, maybe I'm overly concerned here. > > > > > > > > > > > > > > > > > > > > I don't think we want the network namespaces seeing how many threads exist in > > > > > > > the entire system right? > > > > > > > > > > If someone in a non-init net namespace does a "pgrep -c nfsd" don't > > > > > they see the total nfsd thread count for the host? > > > > > > > > > > > > > Yes, they're kernel threads and they aren't associated with a particular > > > > pid namespace. > > > > > > > > > > > > > > > > Additionally it appears that we can have multiple threads per network namespace, > > > > > > > so it's not like this will just show 1 for each individual nn, it'll show > > > > > > > however many threads have been configured for that nfsd in that network > > > > > > > namespace. > > > > > > > > > > I've never tried this, so I'm speculating. But it seems like for > > > > > now, because all nfsd threads can serve all namespaces, they should > > > > > all see the global thread count stat. > > > > > > > > > > Then later we can refine it. > > > > > > > > > > > > > I don't think that info is particularly useful though, and it certainly > > > > breaks expectations wrt container isolation. > > > > > > > > Look at it this way: > > > > > > > > Suppose I have access to a container and I spin up nfsd with a > > > > particular number of threads. I now want to know "did I spin up enough > > > > threads?" > > > > > > It makes sense to me for a container to ask for one or more > > > NFSD listeners. But I'm not yet clear on what it means for > > > a container to try to alter the NFSD thread count, which is > > > currently global. > > > > > > > write_threads sets the number of nfsd threads for the svc_serv, which is > > a per-net object, so serv->sv_nrthreads is only counting the threads for > > its namespace. > > OK. I missed the part where struct svc_serv is per-net, and > each svc_serv has its own thread pool. Got it. > > > > Each container that starts an nfsd will contribute "threads" number of > > nfsds to the pile. When you set "threads" to a different number, the > > total pile is grown or reduced accordingly. In fact, I'm not even sure > > we keep a systemwide total. > > > > What _isn't_ done (unless I'm missing something) is any sort of > > limitation on the use of those threads. So you could have a container > > that starts one nfsd thread, but its clients can still have many more > > RPCs in flight, up to the total number of nfsd's running on the host. > > Limiting that might be possible, but I'm not yet convinced that it's a > > good idea. > > > > It might be interesting to gather some metrics though. How often do the > > number of RPCs in flight exceed the number of threads that we started in > > a namespace? Might also be good to gather a high-water mark too? > > I had a bunch of trace points queued up for this purpose, > and all that got thrown out by Neil's recent work fixing > up our pool thread scheduler. > > I'm not sure that queued requests are all that meaningful > or even quantifiable. A more meaningful metric is round > trip time measured on clients. > I wouldn't try to queue anything just yet, but knowing when a particular container is exceeding the number of threads it started seems like a helpful metric. That could be done in a later patch though. I wouldn't block this work on that. > > > > > By making this per-namespace as Josef suggests it should be > > > > fairly simple to tell whether my clients are regularly overrunning the > > > > threads I started. With this info as global, I have no idea what netns > > > > the RPCs being counted are against. I can't do anything with that info. > > > > > > That's true. I'm just not sure we need to do anything > > > significant about it as part of this patch series. > > > > > > I'm not at all advocating leaving it like this. I agree it > > > needs updating. > > > > > > For now, just fill in that thread count stat with the global > > > thread count, and later when we have a better concept for > > > what it means to "adjust the nfsd thread count from inside a > > > container" we can come back and make it make sense. > > > > It would be good to step back and decide how we think this should work. > > What would this look like if we were designing it today? Then we can > > decide how to get there. > > > > Personally, I think keeping the view inside the container as isolated as > > possible is the right thing to do. > > I agree. I would like to get Josef's patches in quickly too. > > Should that metric report svc_serv::sv_nrthreads instead of > the global thread count? > No. The sv_nrthreads is just the same value that /proc/fs/nfsd/threads shows. The _metric_ should show the th_cnt for that particular namespace. It would also be nice to know what the high-water mark for th_cnt is. IOW, if my containers clients exceeded the number of threads I've started, then it would be helpful to know by how much. -- Jeff Layton <jlayton@xxxxxxxxxx>