> On Feb 18, 2022, at 4:26 PM, Tom Talpey <tom@xxxxxxxxxx> wrote: > > > On 2/18/2022 2:04 PM, Daire Byrne wrote: >> On Wed, 9 Feb 2022 at 17:38, Tom Talpey <tom@xxxxxxxxxx> wrote: >>> >>> On 2/7/2022 1:57 PM, Daire Byrne wrote: >>>> Hi, >>>> >>>> As part of my ongoing investigations into high latency WAN NFS >>>> performance with only a single client (for the purposes of then >>>> re-exporting), I have been looking at the metadata performance >>>> differences between NFSv3 and NFSv4.2. >>>> >>>> High latency seems to be a particularly good way of highlighting the >>>> parallel/concurrency performance limitations with a single NFS client. >>>> So I took a client 200ms away from the server and ran things like >>>> open() and stat() calls to many files & directories using simultaneous >>>> threads (200+) to see how many requests and operations we could keep >>>> in flight simultaneously. >>>> >>>> The executive summary is that NFSv4 is around 10x worse than NFSv3 and >>>> an NFSv4 client clearly flatlines at around 180 ops/s with 200ms. By >>>> comparison, an NFSv3 client can do around 1,500 ops/s (access+lookup) >>>> with the same test. >>>> >>>> On paper, NFSv4 is more compelling over the WAN as it should reduce >>>> round trips with things like compound operations and delegations, but >>>> that's only good if it can do lots of them simultaneously too. >>>> >>>> Comparing the slot table/xport stats between the two protocols while >>>> running the benchmark highlights the difference: >>>> >>>> NFSv3 >>>> opts: rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,mountaddr=10.25.22.17,mountvers=3,mountport=20048,mountproto=udp,fsc,local_lock=none >>>> xprt: tcp 0 1 2 0 0 85480 85380 0 6549783 0 102 166291 6296122 >>>> xprt: tcp 0 1 2 0 0 85827 85727 0 6575842 0 102 149914 6322130 >>>> xprt: tcp 0 1 2 0 0 85674 85574 0 6577487 0 102 131288 6320278 >>>> xprt: tcp 0 1 2 0 0 84943 84843 0 6505613 0 102 182313 6251396 >>>> >>>> NFSv4.2 >>>> opts: rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,clientaddr=10.25.112.8,fsc,local_lock=none >>>> xprt: tcp 0 0 2 0 0 301 301 0 1439 0 9 80 1058 >>>> xprt: tcp 0 0 2 0 0 294 294 0 1452 0 10 79 1085 >>>> xprt: tcp 0 0 2 0 0 292 292 0 1443 0 10 102 1055 >>>> xprt: tcp 0 0 2 0 0 287 286 0 1407 0 9 64 1067 >>>> >>>> So either we aren't putting things into the slot table quickly enough >>>> for it to scale up, or it just isn't scaling for some other reason. >>>> >>>> The max slots of 101 for NFSv3 and 10 for NFSv4.2 probably accounts >>>> for the aggregate difference of 10x I see in benchmarking? >>>> >>>> I tried increasing the /sys/module/nfs/parameters/max_session_slots >>>> from 64 to 128 on the client (modprobe.conf & reboot) but it didn't >>>> seem to make much difference. Maybe it's a server side limit then and >>>> the lowest is being used: >>>> >>>> fs/nfsd/stat.h: >>>> #define NFSD_SLOT_CACHE_SIZE 2048 >>>> /* Maximum number of NFSD_SLOT_CACHE_SIZE slots per session */ >>>> #define NFSD_CACHE_SIZE_SLOTS_PER_SESSION 32 >>>> >>>> I'm sure there are probably good reasons for these values (like >>>> stopping a client from hogging the queue) but is this the reason I see >>>> such a big difference in the performance of concurrency for a single >>>> client over high latencies? >>> >>> Daire, I'm interested in your results if you increase the server slot >>> limits. Remember that the "slot" is an NFSv4.1+ protocol element. In >>> NFSv3 and v4.0, there is no protocol-based flow control, so the max >>> outstanding RPC counts are effectively the smaller of the client's and >>> server's RPC task and/or thread limits, and of course the wire itself. >>> >>> With a 200msec RTT and a single-threaded workload, you'll get 5 ops/sec, >>> times 32 slots that's pretty much the 180 you see. So I'd expect it to >>> rise linearly as you scale both ends' slot numbers. >> I finally got around to testing this again. I recompiled a server kernel with: >> NFSD_CACHE_SIZE_SLOTS_PER_SESSION=256 >> I ran some more tests and as predicted this helps a lot. Because the >> client default for the client's max_sessions_slots=64 (where the >> server is 32), I saw double the concurrency straightaway. > > Nice, thanks for the followup! > >> And then as I increased the client's max_sessions_slots (up to 256) it >> kept on improving. I guess I would need to set the server and client >> slots to be around 512 to see the same concurrency performance as for >> NFSv3 with 200ms. >> Which I guess leads on to some questions: >> 1) Why is NFSD_CACHE_SIZE_SLOTS_PER_SESSION not a tunable? We don't >> really want to maintain our own kernel compiles on our RHEL8 servers. > > I totally agree that it's reasonable to allow tuning. And, 32 is a > woefully small maximum. As denizens of this community know, I don't relish adding tuning knobs when the setting can be abused or set improperly. You'll have to convince me that we can't construct a reasonable and safe internal heuristic that determines a good default slot count value. (meaning: adjustable is OK, but I'd prefer it to be a dynamic and automated setting, not one that needs to be set via an administrative interface). >> 2) Why is the default linux client slot count 64 and the server's is >> 32? You can tune the linux client down and not up (if using a Linux >> server). > > That's for Trond and Chuck I guess. For the Linux NFS server, there is an enhancement request open in this area: https://bugzilla.linux-nfs.org/show_bug.cgi?id=375 If there are any relevant design notes or performance results, that would be the place to put them. IIRC the only downside to a large default slot count on the server is that it can waste memory, and it is difficult to handle the corner cases when the server is running on a small physical host (or in a small container). >> 3) What would be the recommended and safest way to have a few high >> latency clients with increased slots and concurrency? > > So, slot counts are negotiable, and dynamic, between client and > server in NVSv4.1+. But I don't believe that either the Linux client > or server allow them to change after starting a session. > > IMO the best way is to write some code to manage slots both to increase > on demand and decrease on non-use. But dynamic credit management is a > devilishly hard thing to get right. It won't be trivial. > >> I'm thinking it would be better to have the server default be higher >> and the linux client default be 32 instead to replicate the current >> situation. But no doubt there are other storage filers that already >> rely on the fact that the Linux client uses 64 (e.g. cloud Netapps and >> the like). > > If that's true, it'd be a shame. The protocol allows any value. No > constant number will ever be "best", or even correct. > >> It's probably just a lot less hassle to stick with NFSv3 for this kind >> of high latency multi process concurrency use case. > > That, too, would be a shame. It's worth the effort to find a better > NFSv4.1 Linux solution. > > Tom. -- Chuck Lever