Re: Increase RPCSVC_MAXPAYLOAD to 8M?

Cedric Blancher <cedric.blancher@xxxxxxxxx> · Tue, 4 Mar 2025 07:43:00 +0100

On Thu, 6 Feb 2025 at 15:25, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
> On 2/6/25 3:45 AM, Cedric Blancher wrote:
> > On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> >>
> >> On 1/29/25 2:32 AM, Cedric Blancher wrote:
> >>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@xxxxxxxxx> wrote:
> >>>>
> >>>> Good morning!
> >>>>
> >>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
> >>>> giving the NFSv4.1 session mechanism some headroom for negotiation.
> >>>> For over a decade the default value is 1M (1*1024*1024u), which IMO
> >>>> causes problems with anything faster than 2500baseT.
> >>>
> >>> The 1MB limit was defined when 10base5/10baseT was the norm, and
> >>> 100baseT (100mbit) was "fast".
> >>>
> >>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and
> >>> 10000baseT is fast.
> >>> Just the 1MB limit is now in the way of EVERYTHING, including "large
> >>> send offload" and other acceleration features.
> >>>
> >>> So my suggestion is to increase the buffer to 4MB by default (2*2MB
> >>> hugepages on x86), and allow a tuneable to select up to 16MB.
> >>
> >> TL;DR: This has been on the long-term to-do list for NFSD for quite some
> >> time.
> >>
> >> We certainly want to support larger COMPOUNDs, but increasing
> >> RPCSVC_MAXPAYLOAD is only the first step.
> >>
> >> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today
> >> it has 259 entries. Quadrupling that would make the array itself
> >> multiple pages in size, and there's one of these for each nfsd thread.
> >>
> >> We are working on replacing the use of page arrays with folios, which
> >> would make this infrastructure significantly smaller and faster, but it
> >> depends on folio support in all the kernel APIs that NFSD makes use of.
> >> That situation continues to evolve.
> >>
> >> An equivalent issue exists in the Linux NFS client.
> >>
> >> Increasing this capability on the server without having a client that
> >> can make use of it doesn't seem wise.
> >>
> >> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try
> >> some measurements to help make the case (and analyze the operational
> >> costs). I think we need some confidence that increasing the maximum
> >> payload size will not unduly impact small I/O.
> >>
> >> Re: a tunable: I'm not sure why someone would want to tune this number
> >> down from the maximum. You can control how much total memory the server
> >> consumes by reducing the number of nfsd threads.
> >>
> >
> > I want a tuneable for TESTING, i.e. lower default (for now), but allow
> > people to grab a stock Linux kernel, increase tunable, and do testing.
> > Not everyone is happy with doing the voodoo of self-build testing,
> > even more so in the (dark) "Age Of SecureBoot", where a signed kernel
> > is mandatory. Therefore: Tuneable.
>
> That's appropriate for experimentation, but not a good long-term
> solution that should go into the upstream source code.

I disagree. How should - in the age of "secureboot enforcement", which
implies that only kernels with cryptographic signatures can be loaded
on servers - data be collected?

>
> A tuneable in the upstream source base means the upstream community and
> distributors have to support it for a very long time, and these are hard
> to get rid of once they become irrelevant.

No, this tunable is very likely to stay. It defines the DEFAULT for the kernel

>
> We have to provide documentation. That documentation might contain
> recommended values, and those change over time. They spread out over
> the internet and the stale recommended values become a liability.
>
> Admins and users frequently set tuneables incorrectly and that results
> in bugs and support calls.
>
> It increases the size of test matrices.
>
> Adding only one of these might not result in a significant increase in
> maintenance cost, but if we allow one tuneable, then we have to allow
> all of them, and that becomes a living nightmare.

That never ever was a problem for any of the UNIX System V
derivatives, which all have kernel tunables loaded from /etc/system.
No one ever complained, and Linux has the same concept with sysctl

>
> So, not as simple and low-cost as you might think to just "add a
> tuneable" in upstream. And not a sensible choice when all you need is a
> temporary adjustment for testing.
>
> Do you have a reason why, after we agree on an increase, this should
> be a setting that admins will need to lower the value from a default of,
> say, 4MB or more? If so, then it makes sense to consider a tuneable (or
> better, a self-tuning mechanism). For a temporary setting for the
> purpose of experimentation, writing your own patch is the better and
> less costly approach.

Testing, profiling, performance measurements, and a 4M default might
be a problem for embedded machines with only 16MB.

So yes, I think Linux either needs a tunable, or just GIVE UP thinking
about a bigger TCP buffer size. People can always RDMA or other
platforms if they want decent transport performance.

Ced
-- 
Cedric Blancher <cedric.blancher@xxxxxxxxx>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur