> On Jan 13, 2024, at 10:09 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote: >> >> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <dan.f.shelton@xxxxxxxxx> wrote: >>> We've been experiencing significant nfsd performance problems with a >>> customer who has a deeply nested filesystem hierarchy, lots of >>> subdirs, some of them 60-80 dirs deep (!!), which leads to an >>> exponentially slowdown with nfsd accesses. >>> >>> Some of the issues have been addressed by implementing a better >>> directory walker via multiple dir fds and openat() (instead of just >>> cwd+open()), but the nfsd side still was a pretty dramatic issue, >>> until we bumped #define NFSD_MAX_OPS_PER_COMPOUND in >>> linux-6.7/fs/nfsd/nfsd.h from 50 to 96. After that the nfsd side >>> behaved MUCH more performant. >> >> More general question: >> Is it feasible to turn the values for NFSD_MAX_* (max_ops, >> max_req etc., e.g. everything which is being negotiated in a NFSv4.1 >> session) into tuneables, which are set at nfsd startup ? It might help >> with Dan's scenario, benchmarking, client testing (e.g. my case, where >> I switched to nfs4j) and tuning... >> > > (re-cc'ing the mailing list...) > > We generally don't like to add knobs like this when we can get by with > just tuning a sane value for everyone. This particular value governs the > maximum number of operations per compound. I don't see any value in > keeping it artificially low. > > The only real argument against it that I can see is that it might make > it easier for a malicious or badly-designed client to DoS the server. > That's certainly something we should be wary of, but I don't expect that > increasing the max from 50 to ~100 will make a big difference there. The server allocates memory and other resources based on the largest COMPOUND it expects. If we crank the maximum number, it has an impact on server resource utilization. In particular, those extra COMPOUND slots will almost never be used except in a handful of corner cases. Plus, this becomes a race against applications and workloads that try to consume past that limit. We bump it, they use more and hit the new limit. We bump it, lather, rinse, repeat. Indeed, if we increase that value enough, it does become a server DoS vector by tying up all available nfsd threads trying to execute enormous COMPOUNDs. Upshot is I'm not in favor of increasing the max-ops limit or making it tunable, unless we have grossly misunderstood the issue. >> Solaris 11 is known to send COMPOUNDs that are too large >> during mount, but the rest of the time these three client >> implementations are not known to send large COMPOUNDs. > Actually the FreeBSD client is the same as Solaris, in that it does the > entire mount path in one compound. If you were to attempt a mount > with more than 48 components, it would exceed 50 ops in the compound. > I don't think it can exceed 50 ops any other way. I'd like to see the raw packet captures to confirm that our speculation about the problem is indeed correct. Since this limit is hit only when mounting (and not at all by Linux clients), I don't yet see how that would "make NFSD slow". >> I guess your clients are trying to do a long pathwalk in a single >> COMPOUND? > > Is there a problem with that (assuming NFSv4.1 session limits are honored) ? Yes: very clearly the client will hit a rather artificial path length limit. And the limit isn't based on the character length of the path: the limit is hit much sooner with a path that is constructed from a series of very short component names, for instance. Good client implementations keep the number of operations per COMPOUND limited to a small number, and break up operations like path walks to ensure that the protocol and server implementation do not impose any kind of application-visible constraint. >> Is this the windows client? > > No, the ms-nfs41-client (see > https://github.com/kofemann/ms-nfs41-client) uses a limit of |16|, but > it is on our ToDo list to bump that to |128| (but honoring the limit > set by the NFSv4.1 server during session negotiation) since it now > supports very long paths ([1]) and this issue is a known performance > bottleneck. A better way to optimize this case is to walk the path once and cache the terminal component's file handle. This is what Linux does, and it sounds like Dan's directory walker optimizations do effectively the same thing. -- Chuck Lever