Re: RFE: Linux nfsd's |ca_maxoperations| should be at least |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Thu, 18 Jan 2024 14:52:55 +0000

> On Jan 18, 2024, at 4:44 AM, Martin Wege <martin.l.wege@xxxxxxxxx> wrote:
> 
> On Thu, Jan 18, 2024 at 2:57 AM Roland Mainz <roland.mainz@xxxxxxxxxxx> wrote:
>> 
>> On Sat, Jan 13, 2024 at 5:10 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>>>> On Jan 13, 2024, at 10:09 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>>> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
>>>>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <dan.f.shelton@xxxxxxxxx> wrote:
>> [snip]
>>>>> Is this the windows client?
>>>> No, the ms-nfs41-client (see
>>>> https://github.com/kofemann/ms-nfs41-client) uses a limit of |16|, but
>>>> it is on our ToDo list to bump that to |128| (but honoring the limit
>>>> set by the NFSv4.1 server during session negotiation) since it now
>>>> supports very long paths ([1]) and this issue is a known performance
>>>> bottleneck.
>>> 
>>> A better way to optimize this case is to walk the path once
>>> and cache the terminal component's file handle. This is what
>>> Linux does, and it sounds like Dan's directory walker
>>> optimizations do effectively the same thing.
>> 
>> That assumes that no process does random access into deep subdirs. In
>> that case the performance is absolutely terrible, unless you devote
>> lots of memory to a giant cache (which is not feasible due to cache
>> expiration limits, unless someone (please!) finally implements
>> directory delegations).

Do you mean not feasible for your client? Lookup caches
have been part of operating systems for decades. Solaris,
FreeBSD, and Linux all have one. Does the Windows kernel
have one that mfs-nfs41-client can use?

>> This also ignores the use case of WAN (wide-area-networks) and WLAN
>> with the typical high latency and even higher amounts of network
>> package loss&&retransmit, where the splitting of the requests comes
>> with a HUGE latency penalty (you can reproduce this with network
>> tools, just export a large tmpfs on the server, add a package delay of
>> 400ms between client and server, use a path like
>> "a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/0/1/2/3/4/5/6/7/8/9",
>> and compile gcc).

The most frequently implemented solution to this problem
is a lookup cache. Operating systems use it for local
on-disk filesystems as well as for NFS.

In the local filesystem case:

Think about how long each path resolution would take if
the operating system had to consult on-disk information
for every component in the pathname.

In the NFS case:

The fastest round trip is no round trip. Keep a local
cache and path resolution will be fast no matter what
the network latency is.

Note that the NFS server is going to use a lookup cache
to make large path resolution COMPOUNDs go fast. It
would be even faster (from the application's point of
view) if that cache were local to the client.

Sending a full path in a single COMPOUND is one way to
handle path resolution, but it has so many limitations
that it's really not the mechanism of choice.

>> And in the real world the Linux nfsd |ca_maxoperations| default of
>> |16| is absolutely CRIPPELING.
>> For example in the mfs-nfs41-client we need 4 compounds for initial
>> setup for a file lookup, and then 3 per path component. That means
>> that a defaut of 16 just fits (16-4)/3=4 path elements.
>> Unfortunately the statistical average is not 4 - it's 11 (measured
>> over five weeks with 81 clients in our company).
>> Technically, in this scenario, a default of at least 11*3+4=37 would
>> be MUCH better.
>> 
>> That's why I think nfsd's |ca_maxoperations| should be at *least* |64|.
> 
> +1
> 
> I consider the default value of 16 even a bug, given the circumstances.

This is not an NFSD bug. Read to the bottom to see where
the real problem is.

Here are the CREATE_SESSION arguments from a Linux client:

                csa_fore_chan_attrs
                    hdr pad size: 0
                    max req size: 1049620
                    max resp size: 1049480
                    max resp size cached: 7584
                    max ops: 8
                    max reqs: 64
                csa_back_chan_attrs
                    hdr pad size: 0
                    max req size: 4096
                    max resp size: 4096
                    max resp size cached: 0
                    max ops: 2
                    max reqs: 16

The ca_maxoperations field contains 8.

The response from NFSD looks like this:

                csr_fore_chan_attrs
                    hdr pad size: 0
                    max req size: 1049620
                    max resp size: 1049480
                    max resp size cached: 2128
                    max ops: 8
                    max reqs: 30
                csr_back_chan_attrs
                    hdr pad size: 0
                    max req size: 4096
                    max resp size: 4096
                    max resp size cached: 0
                    max ops: 2
                    max reqs: 16

The ca_maxoperations field again contains 8.

Here's what RFC 8881 Section 18.36.3 says:

> ca_maxoperations:
>     The maximum number of operations the replier will accept
>     in a COMPOUND or CB_COMPOUND. For the backchannel, the
>     server MUST NOT change the value the client offers. For
>     the fore channel, the server MAY change the requested
>     value. After the session is created, if a requester sends
>     a COMPOUND or CB_COMPOUND with more operations than
>     ca_maxoperations, the replier MUST return
>     NFS4ERR_TOO_MANY_OPS.

The BCP 14 "MAY" here means that servers can return the same
value, but clients have to expect that a server might return
something different.

Further, the spec does not permit an NFS server to respond to
a COMPOUND with more than the client's ca_maxoperations in
any way other than to return NFS4ERR_TOO_MANY_OPS. So it
cannot return a larger ca_maxoperations than the client sent.

NFSD returns the minimum of the client's max-ops and its own
NFSD_MAX_OPS_PER_COMPOUND value, which is 50. Thus NFSD will
return the same value as the client, unless the client asks
for more than 50.

So, the only reason NFSD returns 16 to your client is because
your client sets a value of 16 in its CREATE_SESSION Call. If
your client sent a larger value (like, 11*3+4), then NFSD will
respect that limit instead.

The spec is very clear about how this needs to work, and
NFSD is 100% compliant to the spec here. It's the client that
has to request a larger limit.

--
Chuck Lever

Re: RFE: Linux nfsd's |ca_maxoperations| should be at *least* |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

Re: RFE: Linux nfsd's |ca_maxoperations| should be at least |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96