Re: Increase RPCSVC_MAXPAYLOAD to 8M?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/6/25 3:45 AM, Cedric Blancher wrote:
> On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>
>> On 1/29/25 2:32 AM, Cedric Blancher wrote:
>>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@xxxxxxxxx> wrote:
>>>>
>>>> Good morning!
>>>>
>>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
>>>> giving the NFSv4.1 session mechanism some headroom for negotiation.
>>>> For over a decade the default value is 1M (1*1024*1024u), which IMO
>>>> causes problems with anything faster than 2500baseT.
>>>
>>> The 1MB limit was defined when 10base5/10baseT was the norm, and
>>> 100baseT (100mbit) was "fast".
>>>
>>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and
>>> 10000baseT is fast.
>>> Just the 1MB limit is now in the way of EVERYTHING, including "large
>>> send offload" and other acceleration features.
>>>
>>> So my suggestion is to increase the buffer to 4MB by default (2*2MB
>>> hugepages on x86), and allow a tuneable to select up to 16MB.
>>
>> TL;DR: This has been on the long-term to-do list for NFSD for quite some
>> time.
>>
>> We certainly want to support larger COMPOUNDs, but increasing
>> RPCSVC_MAXPAYLOAD is only the first step.
>>
>> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today
>> it has 259 entries. Quadrupling that would make the array itself
>> multiple pages in size, and there's one of these for each nfsd thread.
>>
>> We are working on replacing the use of page arrays with folios, which
>> would make this infrastructure significantly smaller and faster, but it
>> depends on folio support in all the kernel APIs that NFSD makes use of.
>> That situation continues to evolve.
>>
>> An equivalent issue exists in the Linux NFS client.
>>
>> Increasing this capability on the server without having a client that
>> can make use of it doesn't seem wise.
>>
>> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try
>> some measurements to help make the case (and analyze the operational
>> costs). I think we need some confidence that increasing the maximum
>> payload size will not unduly impact small I/O.
>>
>> Re: a tunable: I'm not sure why someone would want to tune this number
>> down from the maximum. You can control how much total memory the server
>> consumes by reducing the number of nfsd threads.
>>
> 
> I want a tuneable for TESTING, i.e. lower default (for now), but allow
> people to grab a stock Linux kernel, increase tunable, and do testing.
> Not everyone is happy with doing the voodoo of self-build testing,
> even more so in the (dark) "Age Of SecureBoot", where a signed kernel
> is mandatory. Therefore: Tuneable.

That's appropriate for experimentation, but not a good long-term
solution that should go into the upstream source code.

A tuneable in the upstream source base means the upstream community and
distributors have to support it for a very long time, and these are hard
to get rid of once they become irrelevant.

We have to provide documentation. That documentation might contain
recommended values, and those change over time. They spread out over
the internet and the stale recommended values become a liability.

Admins and users frequently set tuneables incorrectly and that results
in bugs and support calls.

It increases the size of test matrices.

Adding only one of these might not result in a significant increase in
maintenance cost, but if we allow one tuneable, then we have to allow
all of them, and that becomes a living nightmare.

So, not as simple and low-cost as you might think to just "add a
tuneable" in upstream. And not a sensible choice when all you need is a
temporary adjustment for testing.

Do you have a reason why, after we agree on an increase, this should
be a setting that admins will need to lower the value from a default of,
say, 4MB or more? If so, then it makes sense to consider a tuneable (or
better, a self-tuning mechanism). For a temporary setting for the
purpose of experimentation, writing your own patch is the better and
less costly approach.


-- 
Chuck Lever




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux