Re: NFSv4 versus NFSv3 parallel client op/s

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 18 Feb 2022 21:39:16 +0000

> On Feb 18, 2022, at 4:26 PM, Tom Talpey <tom@xxxxxxxxxx> wrote:
> 
> 
> On 2/18/2022 2:04 PM, Daire Byrne wrote:
>> On Wed, 9 Feb 2022 at 17:38, Tom Talpey <tom@xxxxxxxxxx> wrote:
>>> 
>>> On 2/7/2022 1:57 PM, Daire Byrne wrote:
>>>> Hi,
>>>> 
>>>> As part of my ongoing investigations into high latency WAN NFS
>>>> performance with only a single client (for the purposes of then
>>>> re-exporting), I have been looking at the metadata performance
>>>> differences between NFSv3 and NFSv4.2.
>>>> 
>>>> High latency seems to be a particularly good way of highlighting the
>>>> parallel/concurrency performance limitations with a single NFS client.
>>>> So I took a client 200ms away from the server and ran things like
>>>> open() and stat() calls to many files & directories using simultaneous
>>>> threads (200+) to see how many requests and operations we could keep
>>>> in flight simultaneously.
>>>> 
>>>> The executive summary is that NFSv4 is around 10x worse than NFSv3 and
>>>> an NFSv4 client clearly flatlines at around 180 ops/s with 200ms. By
>>>> comparison, an NFSv3 client can do around 1,500 ops/s (access+lookup)
>>>> with the same test.
>>>> 
>>>> On paper, NFSv4 is more compelling over the WAN as it should reduce
>>>> round trips with things like compound operations and delegations, but
>>>> that's only good if it can do lots of them simultaneously too.
>>>> 
>>>> Comparing the slot table/xport stats between the two protocols while
>>>> running the benchmark highlights the difference:
>>>> 
>>>> NFSv3
>>>> opts: rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,mountaddr=10.25.22.17,mountvers=3,mountport=20048,mountproto=udp,fsc,local_lock=none
>>>> xprt: tcp 0 1 2 0 0 85480 85380 0 6549783 0 102 166291 6296122
>>>> xprt: tcp 0 1 2 0 0 85827 85727 0 6575842 0 102 149914 6322130
>>>> xprt: tcp 0 1 2 0 0 85674 85574 0 6577487 0 102 131288 6320278
>>>> xprt: tcp 0 1 2 0 0 84943 84843 0 6505613 0 102 182313 6251396
>>>> 
>>>> NFSv4.2
>>>> opts: rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,clientaddr=10.25.112.8,fsc,local_lock=none
>>>> xprt: tcp 0 0 2 0 0 301 301 0 1439 0 9 80 1058
>>>> xprt: tcp 0 0 2 0 0 294 294 0 1452 0 10 79 1085
>>>> xprt: tcp 0 0 2 0 0 292 292 0 1443 0 10 102 1055
>>>> xprt: tcp 0 0 2 0 0 287 286 0 1407 0 9 64 1067
>>>> 
>>>> So either we aren't putting things into the slot table quickly enough
>>>> for it to scale up, or it just isn't scaling for some other reason.
>>>> 
>>>> The max slots of 101 for NFSv3 and 10 for NFSv4.2 probably accounts
>>>> for the aggregate difference of 10x I see in benchmarking?
>>>> 
>>>> I tried increasing the /sys/module/nfs/parameters/max_session_slots
>>>> from 64 to 128 on the client (modprobe.conf & reboot) but it didn't
>>>> seem to make much difference. Maybe it's a server side limit then and
>>>> the lowest is being used:
>>>> 
>>>> fs/nfsd/stat.h:
>>>> #define NFSD_SLOT_CACHE_SIZE            2048
>>>> /* Maximum number of NFSD_SLOT_CACHE_SIZE slots per session */
>>>> #define NFSD_CACHE_SIZE_SLOTS_PER_SESSION       32
>>>> 
>>>> I'm sure there are probably good reasons for these values (like
>>>> stopping a client from hogging the queue) but is this the reason I see
>>>> such a big difference in the performance of concurrency for a single
>>>> client over high latencies?
>>> 
>>> Daire, I'm interested in your results if you increase the server slot
>>> limits. Remember that the "slot" is an NFSv4.1+ protocol element. In
>>> NFSv3 and v4.0, there is no protocol-based flow control, so the max
>>> outstanding RPC counts are effectively the smaller of the client's and
>>> server's RPC task and/or thread limits, and of course the wire itself.
>>> 
>>> With a 200msec RTT and a single-threaded workload, you'll get 5 ops/sec,
>>> times 32 slots that's pretty much the 180 you see. So I'd expect it to
>>> rise linearly as you scale both ends' slot numbers.
>> I finally got around to testing this again. I recompiled a server kernel with:
>> NFSD_CACHE_SIZE_SLOTS_PER_SESSION=256
>> I ran some more tests and as predicted this helps a lot. Because the
>> client default for the client's max_sessions_slots=64 (where the
>> server is 32), I saw double the concurrency straightaway.
> 
> Nice, thanks for the followup!
> 
>> And then as I increased the client's max_sessions_slots (up to 256) it
>> kept on improving. I guess I would need to set the server and client
>> slots to be around 512 to see the same concurrency performance as for
>> NFSv3 with 200ms.
>> Which I guess leads on to some questions:
>> 1) Why is NFSD_CACHE_SIZE_SLOTS_PER_SESSION not a tunable? We don't
>> really want to maintain our own kernel compiles on our RHEL8 servers.
> 
> I totally agree that it's reasonable to allow tuning. And, 32 is a
> woefully small maximum.

As denizens of this community know, I don't relish adding
tuning knobs when the setting can be abused or set improperly.
You'll have to convince me that we can't construct a reasonable
and safe internal heuristic that determines a good default slot
count value. (meaning: adjustable is OK, but I'd prefer it to
be a dynamic and automated setting, not one that needs to be
set via an administrative interface).

>> 2) Why is the default linux client slot count 64 and the server's is
>> 32? You can tune the linux client down and not up (if using a Linux
>> server).
> 
> That's for Trond and Chuck I guess.

For the Linux NFS server, there is an enhancement request open
in this area:

https://bugzilla.linux-nfs.org/show_bug.cgi?id=375

If there are any relevant design notes or performance results,
that would be the place to put them.

IIRC the only downside to a large default slot count on the
server is that it can waste memory, and it is difficult to handle
the corner cases when the server is running on a small physical
host (or in a small container).

>> 3) What would be the recommended and safest way to have a few high
>> latency clients with increased slots and concurrency?
> 
> So, slot counts are negotiable, and dynamic, between client and
> server in NVSv4.1+. But I don't believe that either the Linux client
> or server allow them to change after starting a session.
> 
> IMO the best way is to write some code to manage slots both to increase
> on demand and decrease on non-use. But dynamic credit management is a
> devilishly hard thing to get right. It won't be trivial.
> 
>> I'm thinking it would be better to have the server default be higher
>> and the linux client default be 32 instead to replicate the current
>> situation. But no doubt there are other storage filers that already
>> rely on the fact that the Linux client uses 64 (e.g. cloud Netapps and
>> the like).
> 
> If that's true, it'd be a shame. The protocol allows any value. No
> constant number will ever be "best", or even correct.
> 
>> It's probably just a lot less hassle to stick with NFSv3 for this kind
>> of high latency multi process concurrency use case.
> 
> That, too, would be a shame. It's worth the effort to find a better
> NFSv4.1 Linux solution.
> 
> Tom.

--
Chuck Lever