Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

Jeff Layton <jlayton@xxxxxxxxxx> · Sat, 13 Jan 2024 16:14:21 -0500

On Sat, 2024-01-13 at 16:10 +0000, Chuck Lever III wrote:
> 
> > On Jan 13, 2024, at 10:09 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > 
> > On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
> > > 
> > > On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <dan.f.shelton@xxxxxxxxx> wrote:
> > > > We've been experiencing significant nfsd performance problems with a
> > > > customer who has a deeply nested filesystem hierarchy, lots of
> > > > subdirs, some of them 60-80 dirs deep (!!), which leads to an
> > > > exponentially slowdown with nfsd accesses.
> > > > 
> > > > Some of the issues have been addressed by implementing a better
> > > > directory walker via multiple dir fds and openat() (instead of just
> > > > cwd+open()), but the nfsd side still was a pretty dramatic issue,
> > > > until we bumped #define NFSD_MAX_OPS_PER_COMPOUND in
> > > > linux-6.7/fs/nfsd/nfsd.h from 50 to 96. After that the nfsd side
> > > > behaved MUCH more performant.
> > > 
> > > More general question:
> > > Is it feasible to turn the values for NFSD_MAX_* (max_ops,
> > > max_req etc., e.g. everything which is being negotiated in a NFSv4.1
> > > session) into tuneables, which are set at nfsd startup ? It might help
> > > with Dan's scenario, benchmarking, client testing (e.g. my case, where
> > > I switched to nfs4j) and tuning...
> > > 
> > 
> > (re-cc'ing the mailing list...)
> > 
> > We generally don't like to add knobs like this when we can get by with
> > just tuning a sane value for everyone. This particular value governs the
> > maximum number of operations per compound. I don't see any value in
> > keeping it artificially low.
> > 
> > The only real argument against it that I can see is that it might make
> > it easier for a malicious or badly-designed client to DoS the server.
> > That's certainly something we should be wary of, but I don't expect that
> > increasing the max from 50 to ~100 will make a big difference there.
> 
> The server allocates memory and other resources based on the
> largest COMPOUND it expects.
> 
> If we crank the maximum number, it has an impact on server
> resource utilization. In particular, those extra COMPOUND
> slots will almost never be used except in a handful of corner
> cases.
> 
> Plus, this becomes a race against applications and workloads
> that try to consume past that limit. We bump it, they use
> more and hit the new limit. We bump it, lather, rinse,
> repeat.
> 
> Indeed, if we increase that value enough, it does become a
> server DoS vector by tying up all available nfsd threads
> trying to execute enormous COMPOUNDs.
> 
> Upshot is I'm not in favor of increasing the max-ops limit or
> making it tunable, unless we have grossly misunderstood the
> issue.
> 

Does it? The only thing that I could see that scales directly with that
value is the size of struct nfsd_genl_rqstp. That's just part of the new
netlink stats interface, so I don't see that as a show stopper. Am I
missing something else that scales directly with
NFSD_MAX_OPS_PER_COMPOUND?

> 
> > > Solaris 11 is known to send COMPOUNDs that are too large
> > > during mount, but the rest of the time these three client
> > > implementations are not known to send large COMPOUNDs.
> > Actually the FreeBSD client is the same as Solaris, in that it does the
> > entire mount path in one compound. If you were to attempt a mount
> > with more than 48 components, it would exceed 50 ops in the compound.
> > I don't think it can exceed 50 ops any other way.
> 
> 
> I'd like to see the raw packet captures to confirm that our
> speculation about the problem is indeed correct. Since this
> limit is hit only when mounting (and not at all by Linux
> clients), I don't yet see how that would "make NFSD slow".
> 

It seems quite plausible that keeping the max low causes the client to
have to do a deep pathwalk using multiple RPCs instead of one. That
seems like it could have performance implications.

> > > I guess your clients are trying to do a long pathwalk in a single
> > > COMPOUND?
> > 
> > Is there a problem with that (assuming NFSv4.1 session limits are honored) ?
> 
> Yes: very clearly the client will hit a rather artificial
> path length limit. And the limit isn't based on the character
> length of the path: the limit is hit much sooner with a path
> that is constructed from a series of very short component
> names, for instance.
> 
> Good client implementations keep the number of operations per
> COMPOUND limited to a small number, and break up operations
> like path walks to ensure that the protocol and server
> implementation do not impose any kind of application-visible
> constraint.
> 
> 

Sure, and good servers try their best to deal with whatever the clients
throw at them. I don't really see the value in limiting the number of
ops per compound. Are we really any better off having the client break
those up into multiple round trips? Why?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>