On Thu, 6 Feb 2025 at 15:25, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > On 2/6/25 3:45 AM, Cedric Blancher wrote: > > On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > >> > >> On 1/29/25 2:32 AM, Cedric Blancher wrote: > >>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@xxxxxxxxx> wrote: > >>>> > >>>> Good morning! > >>>> > >>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, > >>>> giving the NFSv4.1 session mechanism some headroom for negotiation. > >>>> For over a decade the default value is 1M (1*1024*1024u), which IMO > >>>> causes problems with anything faster than 2500baseT. > >>> > >>> The 1MB limit was defined when 10base5/10baseT was the norm, and > >>> 100baseT (100mbit) was "fast". > >>> > >>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and > >>> 10000baseT is fast. > >>> Just the 1MB limit is now in the way of EVERYTHING, including "large > >>> send offload" and other acceleration features. > >>> > >>> So my suggestion is to increase the buffer to 4MB by default (2*2MB > >>> hugepages on x86), and allow a tuneable to select up to 16MB. > >> > >> TL;DR: This has been on the long-term to-do list for NFSD for quite some > >> time. > >> > >> We certainly want to support larger COMPOUNDs, but increasing > >> RPCSVC_MAXPAYLOAD is only the first step. > >> > >> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today > >> it has 259 entries. Quadrupling that would make the array itself > >> multiple pages in size, and there's one of these for each nfsd thread. > >> > >> We are working on replacing the use of page arrays with folios, which > >> would make this infrastructure significantly smaller and faster, but it > >> depends on folio support in all the kernel APIs that NFSD makes use of. > >> That situation continues to evolve. > >> > >> An equivalent issue exists in the Linux NFS client. > >> > >> Increasing this capability on the server without having a client that > >> can make use of it doesn't seem wise. > >> > >> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try > >> some measurements to help make the case (and analyze the operational > >> costs). I think we need some confidence that increasing the maximum > >> payload size will not unduly impact small I/O. > >> > >> Re: a tunable: I'm not sure why someone would want to tune this number > >> down from the maximum. You can control how much total memory the server > >> consumes by reducing the number of nfsd threads. > >> > > > > I want a tuneable for TESTING, i.e. lower default (for now), but allow > > people to grab a stock Linux kernel, increase tunable, and do testing. > > Not everyone is happy with doing the voodoo of self-build testing, > > even more so in the (dark) "Age Of SecureBoot", where a signed kernel > > is mandatory. Therefore: Tuneable. > > That's appropriate for experimentation, but not a good long-term > solution that should go into the upstream source code. I disagree. How should - in the age of "secureboot enforcement", which implies that only kernels with cryptographic signatures can be loaded on servers - data be collected? > > A tuneable in the upstream source base means the upstream community and > distributors have to support it for a very long time, and these are hard > to get rid of once they become irrelevant. No, this tunable is very likely to stay. It defines the DEFAULT for the kernel > > We have to provide documentation. That documentation might contain > recommended values, and those change over time. They spread out over > the internet and the stale recommended values become a liability. > > Admins and users frequently set tuneables incorrectly and that results > in bugs and support calls. > > It increases the size of test matrices. > > Adding only one of these might not result in a significant increase in > maintenance cost, but if we allow one tuneable, then we have to allow > all of them, and that becomes a living nightmare. That never ever was a problem for any of the UNIX System V derivatives, which all have kernel tunables loaded from /etc/system. No one ever complained, and Linux has the same concept with sysctl > > So, not as simple and low-cost as you might think to just "add a > tuneable" in upstream. And not a sensible choice when all you need is a > temporary adjustment for testing. > > Do you have a reason why, after we agree on an increase, this should > be a setting that admins will need to lower the value from a default of, > say, 4MB or more? If so, then it makes sense to consider a tuneable (or > better, a self-tuning mechanism). For a temporary setting for the > purpose of experimentation, writing your own patch is the better and > less costly approach. Testing, profiling, performance measurements, and a 4M default might be a problem for embedded machines with only 16MB. So yes, I think Linux either needs a tunable, or just GIVE UP thinking about a bigger TCP buffer size. People can always RDMA or other platforms if they want decent transport performance. Ced -- Cedric Blancher <cedric.blancher@xxxxxxxxx> [https://plus.google.com/u/0/+CedricBlancher/] Institute Pasteur