RE: NFS over RDMA benchmark

Yan Burman <yanb@xxxxxxxxxxxx> · Mon, 22 Apr 2013 11:07:36 +0000

> -----Original Message-----
> From: Peng Tao [mailto:bergwolf@xxxxxxxxx]
> Sent: Friday, April 19, 2013 05:28
> To: Yan Burman
> Cc: J. Bruce Fields; Tom Tucker; linux-rdma@xxxxxxxxxxxxxxx; linux-
> nfs@xxxxxxxxxxxxxxx
> Subject: Re: NFS over RDMA benchmark
> 
> On Wed, Apr 17, 2013 at 10:36 PM, Yan Burman <yanb@xxxxxxxxxxxx>
> wrote:
> > Hi.
> >
> > I've been trying to do some benchmarks for NFS over RDMA and I seem to
> only get about half of the bandwidth that the HW can give me.
> > My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
> Mellanox ConnectX3 QDR card over PCI-e gen3.
> > These servers are connected to a QDR IB switch. The backing storage on the
> server is tmpfs mounted with noatime.
> > I am running kernel 3.5.7.
> >
> > When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same
> block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > I got to these results after the following optimizations:
> > 1. Setting IRQ affinity to the CPUs that are part of the NUMA node the
> > card is on 2. Increasing
> > /proc/sys/sunrpc/svc_rdma/max_outbound_read_requests and
> > /proc/sys/sunrpc/svc_rdma/max_requests to 256 on server 3. Increasing
> > RPCNFSDCOUNT to 32 on server
> Did you try to affine nfsd to corresponding CPUs where your IB card locates?
> Given that you see a bottleneck on CPU (as in your later email), it might be
> worth trying.

I tried to affine nfsd to CPUs on the NUMA node the IB card is on.
I also set tmpfs memory policy to allocate from the same NUMA node.
I did not see big difference.

> 
> > 4. FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128
> > --ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
> > --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1
> > --norandommap --group_reporting --exitall --buffered=0
> >
> On client side, it may be good to affine FIO processes and nfsiod to CPUs
> where IB card locates as well, in case client is the bottleneck.
> 

I am doing that - cpumask=255 affines it to the NUMA node my card is on.
For some reason doing taskset on nfsiod fails.

> --
> Thanks,
> Tao
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥