Re: NFS over RDMA benchmark

Wendy Cheng <s.wendy.cheng@xxxxxxxxx> · Sun, 28 Apr 2013 22:34:50 -0700

On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:

>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman

>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> ...

[snip]

>>     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>
> That's the inode i_mutex.
>
>>     14.70%-- svc_send
>
> That's the xpt_mutex (ensuring rpc replies aren't interleaved).
>
>>
>>      9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>>
>
> And that (and __free_iova below) looks like iova_rbtree_lock.
>
>

Let's revisit your command:

"FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128
--ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1
--norandommap --group_reporting --exitall --buffered=0"

* inode's i_mutex:
If increasing process/file count didn't help, maybe increase "iodepth"
(say 512 ?) could offset the i_mutex overhead a little bit ?

* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as "svc_rdma_sendto()" could do better but maybe
sequential IO (instead of "randread") could help ? Bigger block size
(instead of 4K) can help ?

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html