Hi Chuck,
thanks for looking into this. (Answers inline...)
Chuck Lever wrote on 27.01.2020 15:12:
On Jan 27, 2020, at 9:06 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
Hi Sven-
On Jan 26, 2020, at 6:41 PM, Sven Breuner <sven@xxxxxxxxxxxx> wrote:
Hi,
I'm using the kernel NFS client/server and am trying to read as many small files per second as possible from a single NFS client, but seem to run into a bottleneck.
Maybe this is just a tunable that I am missing, because the CPUs on client and server side are mostly idle, the 100Gbit (RoCE) network links between client and server are also mostly idle and the NVMe drives in the server are also mostly idle (and the server has enough RAM to easily fit my test data set in the ext4/xfs page cache, but also a 2nd read of the data set from the RAM cache doesn't change the result much).
This is my test case:
# Create 1.6M 10KB files through 128 mdtest processes in different directories...
$ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -C
# Read all the files through 128 mdtest processes (the case that matters primarily for my test)...
$ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -E
The result is about 20,000 file reads per sec, so only ~200MB/s network throughput.
What is the typical size of the NFS READ I/Os on the wire?
The application is fetching each full 10KB file in a single read op (so
"read(fd, buf, 10240)" ) and NFS wsize/rsize is 512KB.
Are you sure your mpirun workload is generating enough parallelism?
Yes, MPI is only used to start the 128 processes and aggregate the performance
results in the end. For the actual file read phase, all 128 processes run
completely independent without any communication/synchronization. Each process
is working in its own subdir with its own set of 10KB files.
(Running the same test directly on the local xfs of the NFS server box results
in ~350,000 10KB file reads per sec after cache drop and >1 mio 10KB file reads
per sec from page cache. Just mentioning this for the sake of completeness to
show that this is not hitting a limit on the server side.)
A couple of other thoughts:
What's the client hardware like? NUMA? Fast memory? CPU count?
Client and server are dual socket Intel Xeon E5-2690 v4 @ 2.60GHz (14 cores per
socket plus hyper threading), all 4 memory channels per socket populated with
fastest possible DIMMs (DDR4 2400).
Also tried pool_mode auto/global/pernode on server side.
Have you configured device interrupt affinity and used tuned
to disable CPU sleep states, etc?
Yes, CPU power saving (frequency scaling) disabled. Tried tuned profiles
latency-performance and and throughput-performance. Also tried irqbalance and
mlnx_affinity.
All without any significant effect unfortunately.
Have you properly configured your 100GbE switch and cards?
I have a Mellanox SN2100 here and two hosts with CX-5 Ethernet.
The configuration of the cards and switch is critical to good
performance.
Yes, I can absolutely confirm that having this part of the config correct is
critical for great performance :-) All configured with PFC and ECN and
double-checked for packets to be tagged correctly and lossless in the RoCE case.
The topology is simple: Client and server connected to same Mellanox switch,
nothing else happening on the switch.
I noticed in "top" that only 4 nfsd processes are active, so I'm wondering why the load is not spread across more of my 64 /proc/fs/nfsd/threads, but even the few nfsd processes that are active use less than 50% of their core each. The CPUs are shown as >90% idle in "top" on client and server during the read phase.
I've tried:
* CentOS 7.5 and 7.6 kernels (3.10.0-...) on client and server; and Ubuntu 18 with 4.18 kernel on server side
* TCP & RDMA
* Mounted as NFSv3/v4.1/v4.2
* Increased tcp_slot_table_entries to 1024
...but all that didn't change the fact that only 4 nfsd processes are active on the server and thus I'm getting the same result already if /proc/fs/nfsd/threads is set to only 4 instead of 64.
Any pointer to how I can overcome this limit will be greatly appreciated.
Thanks in advance
Sven
--
Chuck Lever
--
Chuck Lever