Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Sat, 13 Jan 2024 16:10:56 +0000

> On Jan 13, 2024, at 10:09 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
>> 
>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <dan.f.shelton@xxxxxxxxx> wrote:
>>> We've been experiencing significant nfsd performance problems with a
>>> customer who has a deeply nested filesystem hierarchy, lots of
>>> subdirs, some of them 60-80 dirs deep (!!), which leads to an
>>> exponentially slowdown with nfsd accesses.
>>> 
>>> Some of the issues have been addressed by implementing a better
>>> directory walker via multiple dir fds and openat() (instead of just
>>> cwd+open()), but the nfsd side still was a pretty dramatic issue,
>>> until we bumped #define NFSD_MAX_OPS_PER_COMPOUND in
>>> linux-6.7/fs/nfsd/nfsd.h from 50 to 96. After that the nfsd side
>>> behaved MUCH more performant.
>> 
>> More general question:
>> Is it feasible to turn the values for NFSD_MAX_* (max_ops,
>> max_req etc., e.g. everything which is being negotiated in a NFSv4.1
>> session) into tuneables, which are set at nfsd startup ? It might help
>> with Dan's scenario, benchmarking, client testing (e.g. my case, where
>> I switched to nfs4j) and tuning...
>> 
> 
> (re-cc'ing the mailing list...)
> 
> We generally don't like to add knobs like this when we can get by with
> just tuning a sane value for everyone. This particular value governs the
> maximum number of operations per compound. I don't see any value in
> keeping it artificially low.
> 
> The only real argument against it that I can see is that it might make
> it easier for a malicious or badly-designed client to DoS the server.
> That's certainly something we should be wary of, but I don't expect that
> increasing the max from 50 to ~100 will make a big difference there.

The server allocates memory and other resources based on the
largest COMPOUND it expects.

If we crank the maximum number, it has an impact on server
resource utilization. In particular, those extra COMPOUND
slots will almost never be used except in a handful of corner
cases.

Plus, this becomes a race against applications and workloads
that try to consume past that limit. We bump it, they use
more and hit the new limit. We bump it, lather, rinse,
repeat.

Indeed, if we increase that value enough, it does become a
server DoS vector by tying up all available nfsd threads
trying to execute enormous COMPOUNDs.

Upshot is I'm not in favor of increasing the max-ops limit or
making it tunable, unless we have grossly misunderstood the
issue.

>> Solaris 11 is known to send COMPOUNDs that are too large
>> during mount, but the rest of the time these three client
>> implementations are not known to send large COMPOUNDs.
> Actually the FreeBSD client is the same as Solaris, in that it does the
> entire mount path in one compound. If you were to attempt a mount
> with more than 48 components, it would exceed 50 ops in the compound.
> I don't think it can exceed 50 ops any other way.

I'd like to see the raw packet captures to confirm that our
speculation about the problem is indeed correct. Since this
limit is hit only when mounting (and not at all by Linux
clients), I don't yet see how that would "make NFSD slow".

>> I guess your clients are trying to do a long pathwalk in a single
>> COMPOUND?
> 
> Is there a problem with that (assuming NFSv4.1 session limits are honored) ?

Yes: very clearly the client will hit a rather artificial
path length limit. And the limit isn't based on the character
length of the path: the limit is hit much sooner with a path
that is constructed from a series of very short component
names, for instance.

Good client implementations keep the number of operations per
COMPOUND limited to a small number, and break up operations
like path walks to ensure that the protocol and server
implementation do not impose any kind of application-visible
constraint.

>> Is this the windows client?
> 
> No, the ms-nfs41-client (see
> https://github.com/kofemann/ms-nfs41-client) uses a limit of |16|, but
> it is on our ToDo list to bump that to |128| (but honoring the limit
> set by the NFSv4.1 server during session negotiation) since it now
> supports very long paths ([1]) and this issue is a known performance
> bottleneck.

A better way to optimize this case is to walk the path once
and cache the terminal component's file handle. This is what
Linux does, and it sounds like Dan's directory walker
optimizations do effectively the same thing.

--
Chuck Lever