Re: |ca_maxoperations| - tuneable ? / was: Re: RFE: Linux nfsd's |ca_maxoperations| should be at least |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Sat, 16 Mar 2024 16:35:41 +0000

> On Mar 16, 2024, at 7:55 AM, Roland Mainz <roland.mainz@xxxxxxxxxxx> wrote:
> 
> On Thu, Jan 18, 2024 at 3:52 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>>> On Jan 18, 2024, at 4:44 AM, Martin Wege <martin.l.wege@xxxxxxxxx> wrote:
>>> On Thu, Jan 18, 2024 at 2:57 AM Roland Mainz <roland.mainz@xxxxxxxxxxx> wrote:
>>>> On Sat, Jan 13, 2024 at 5:10 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>>>>>> On Jan 13, 2024, at 10:09 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>>>>> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
>>>>>>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <dan.f.shelton@xxxxxxxxx> wrote:
> [snip]
>>>> That assumes that no process does random access into deep subdirs. In
>>>> that case the performance is absolutely terrible, unless you devote
>>>> lots of memory to a giant cache (which is not feasible due to cache
>>>> expiration limits, unless someone (please!) finally implements
>>>> directory delegations).
>> 
>> Do you mean not feasible for your client? Lookup caches
>> have been part of operating systems for decades. Solaris,
>> FreeBSD, and Linux all have one. Does the Windows kernel
>> have one that mfs-nfs41-client can use?
> 
> The ms-nfs41-client has its own cache.
> Technically Windows has another, but that is in the kernel and
> difficult to connect to the NFS client daemon without performance
> issues.
> 
> [snip]
>> Sending a full path in a single COMPOUND is one way to
>> handle path resolution, but it has so many limitations
>> that it's really not the mechanism of choice.
> 
> Which limitations ?

The most important limitation is the maximum size of
a forward channel RPC Call and Reply:

        count4 ca_maxrequestsize;
        count4 ca_maxresponsesize;

You can't put more COMPOUND operations in a single RPC
than will fit within these limits.

> The reason why I am looking to stuff more info into a request:
> - VPN has very high latency, so splitting requests hurts performance *BADLY*.

Sure, if your client serializes the requests as you
describe below, adding a network transit latency is
going to be a problem. I recommend that your client
not rely on the server and network to guarantee
request processing order. It should instead enforce
its own ordering requirements.

> I've been slapped about path/dir lookup performance now many times,
> and while there is more than one issue (Cygwin looks for "file" and
> "file.lnk"&co for each file + our readdir implementation needs lots of
> work) the biggest issue that we split requests up because they usually
> do not fit.

High latency is something that is a well-understood
problem. You are better off caching lookup results
on your client to reduce the amount of slow
interaction a client has with the server. This is
the way every other NFS client works.

> - Windows API is async+multithreaded, which results in that requests
> do not always come in the logical/expected/useful order, which leads
> to cache issues.
> Seriously this issue is so bad that it is worth a research paper

Your client really should serialize itself and not
rely on the server for ordering. If the client has a
serialization requirement, it needs to enforce those
itself. Any modern I/O system is going to be "fire
and forget" -- it will then wait and handle the
replies in whatever order they arrive. Your client
caches should do the same.

> - Real-world paths on Windows are LONG with many subdirs, even worse
> when projects and organisations change, shift, reorganise, move,
> merge, split, get outsourced etc. over *DECADES*. Plus non-IT-users
> have zero awareness about "path limits", and sometimes dump whole
> sentences into directory names (e.g. "customer XYZ. can be ignored he
> terminated the business relationship on 26 May 2001. please do not
> delete dir" <----- xxx@@!!!! ).
> That issue haunts us in other ways too, e.g.  in the ms-nfs41-client
> project I had to extend the maximum supported path length multiple
> times to support this craziness, right now we support 4096 byte paths
> ([1]), with the longest known path being 1772, and others reported
> even more.

Again, your client really needs to handle this
scalably by breaking the path into a component at
a time and caching the directory hierarchy
locally. It's not going to work by bumping up
these limits over time because you will always
hit some limit in the protocol.

> And this is not a specific issue to my current employer, I've seen
> this in customer installations when I was at SUN (including long
> debates about Solaris's 1024 byte limit) and RedHat too.

POSIX based filesystems have hard limits on path
length in number of bytes. That's not going to
change just because these file systems are
exported via NFS.

> [1]=Windows opened the next can of pandora with removing the MAXPATH
> limit a while ago, e.g. see
> https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry
> - and even before that there was the "\\?\" prefix.
> 
> [snip]
>>> ca_maxoperations:
>>>    The maximum number of operations the replier will accept
>>>    in a COMPOUND or CB_COMPOUND. For the backchannel, the
>>>    server MUST NOT change the value the client offers. For
>>>    the fore channel, the server MAY change the requested
>>>    value. After the session is created, if a requester sends
>>>    a COMPOUND or CB_COMPOUND with more operations than
>>>    ca_maxoperations, the replier MUST return
>>>    NFS4ERR_TOO_MANY_OPS.
>> 
>> The BCP 14 "MAY" here means that servers can return the same
>> value, but clients have to expect that a server might return
>> something different.
>> 
>> Further, the spec does not permit an NFS server to respond to
>> a COMPOUND with more than the client's ca_maxoperations in
>> any way other than to return NFS4ERR_TOO_MANY_OPS. So it
>> cannot return a larger ca_maxoperations than the client sent.
>> 
>> NFSD returns the minimum of the client's max-ops and its own
>> NFSD_MAX_OPS_PER_COMPOUND value, which is 50. Thus NFSD will
>> return the same value as the client, unless the client asks
>> for more than 50.
> 
> I finally (yay - Saturday) had a look at this issue and
> collected&&processed statistics.
> With a Linux 6.6.20-rt25 kernel nfsd I get this in the ms-nfs41-client:
> ---- snip ----
> 1010: requested: req.csa_fore_chan_attrs.(ca_maxoperations=16384,
> ca_maxrequests=128)
> 1010: response:  session->fore_chan_attrs->(ca_maxoperations=50,
> ca_maxrequests=66)
> ---- snip ----
> 
> So - if I understand it correctly - the negotiation works correctly,
> and we get |ca_maxoperations=50| and |ca_maxrequests=66|.

> But... this value is too small, at least for what we do on Windows.
> I've collected samples (84 machines, a wide range of users, MS Office,
> ERP, CAD, etc.) and 71% of all server lookup calls had to be split
> (Linux 6.6 LTS kernel nfsd) for |ca_maxoperations==50|, 39% for
> |ca_maxoperations==64| and <1% for |ca_maxoperations==80|.

I can't imagine 80 being sufficient for more than
a year or two, given the other things you've
mentioned in this thread.

Have you considered adding a local NFS caching
server between your local Windows clients and
the network-distant NFS servers where the data
is stored?

> Question is... should the values for |ca_*| be a tuneable, or just
> increase the limit to |80| ([1]) ?

A server tunable will never completely address
this issue, and everyone will ask what's the
right value for this tunable? Where's the
documentation? Why can't I have another tunable
just for my favorite issue? So for me, yet
another server tunable is off the table.

Jeff suggested a plan to remove the max-operations
limit, and rely on ca_maxrequestsize instead,
which is a more solid limit though it would allow
more operations per COMPOUND.

But it sounds like you'll hit that limit too
rather quickly until your client caches lookups
properly.

TL;DR: relying on the ability to resolve a full
pathname in a single NFSv4 COMPOUND is a mistaken
and limited design and is already biting you. You
should address this root cause instead of
plastering over the real problem.

Yes, COMPOUND was added to NFSv4 as a possible
way to manage network latency, but in hindsight
I think the NFS community now recognizes that
there are more effective strategies to deal with
network latency than creating more and more
complicated COMPOUND operations. Client-side
caching, for instance, is a much better choice.

--
Chuck Lever

Re: |ca_maxoperations| - tuneable ? / was: Re: RFE: Linux nfsd's |ca_maxoperations| should be at *least* |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

Re: |ca_maxoperations| - tuneable ? / was: Re: RFE: Linux nfsd's |ca_maxoperations| should be at least |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96