Re: NFS over RDMA benchmark

Tom Tucker <tom@xxxxxxxxxxxxxxxxxxxxx> · Mon, 29 Apr 2013 08:07:24 -0500

On 4/29/13 8:05 AM, Tom Tucker wrote:
On 4/29/13 7:16 AM, Yan Burman wrote:

-----Original Message-----
From: Wendy Cheng [mailto:s.wendy.cheng@xxxxxxxxx]
Sent: Monday, April 29, 2013 08:35
To: J. Bruce Fields
Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@xxxxxxxxxxxxxxx;
linux-nfs@xxxxxxxxxxxxxxx; Or Gerlitz
Subject: Re: NFS over RDMA benchmark

On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields 
<bfields@xxxxxxxxxxxx> wrote:

On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
same block sizes (4-512K). running over IPoIB-CM, I get 
200-980MB/sec.
...
[snip]

     36.18%          nfsd [kernel.kallsyms]   [k] mutex_spin_on_owner
That's the inode i_mutex.

     14.70%-- svc_send
That's the xpt_mutex (ensuring rpc replies aren't interleaved).

      9.63%          nfsd [kernel.kallsyms]   [k] 
_raw_spin_lock_irqsave

And that (and __free_iova below) looks like iova_rbtree_lock.

Let's revisit your command:

"FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 
--randrepeat=1 --
norandommap --group_reporting --exitall --buffered=0"

I tried block sizes from 4-512K.
4K does not give 2.2GB bandwidth - optimal bandwidth is achieved 
around 128-256K block size

* inode's i_mutex:
If increasing process/file count didn't help, maybe increase "iodepth"
(say 512 ?) could offset the i_mutex overhead a little bit ?

I tried with different iodepth parameters, but found no improvement 
above iodepth 128.

* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as "svc_rdma_sendto()" could do better but maybe 
sequential
IO (instead of "randread") could help ? Bigger block size (instead 
of 4K) can
help ?

I think the biggest issue is that max_payload for TCP is 2MB but only 
256k for RDMA.

Sorry, I mean 1MB...

I am trying to simulate real load (more or less), that is the reason 
I use randread. Anyhow, read does not result in better performance.
It's probably because backing storage is tmpfs...

Yan

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html