RE: NFS over RDMA benchmark

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@xxxxxxxxxxxx]
> Sent: Wednesday, April 24, 2013 18:27
> To: Yan Burman
> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@xxxxxxxxxxxxxxx;
> linux-nfs@xxxxxxxxxxxxxxx; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: J. Bruce Fields [mailto:bfields@xxxxxxxxxxxx]
> > > > Sent: Wednesday, April 24, 2013 00:06
> > > > To: Yan Burman
> > > > Cc: Wendy Cheng; Atchley, Scott; Tom Tucker;
> > > > linux-rdma@xxxxxxxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; Or Gerlitz
> > > > Subject: Re: NFS over RDMA benchmark
> > > >
> > > > On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Wendy Cheng [mailto:s.wendy.cheng@xxxxxxxxx]
> > > > > > Sent: Wednesday, April 17, 2013 21:06
> > > > > > To: Atchley, Scott
> > > > > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > > > > linux-rdma@xxxxxxxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx
> > > > > > Subject: Re: NFS over RDMA benchmark
> > > > > >
> > > > > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > > > > <atchleyes@xxxxxxxx>
> > > > > > wrote:
> > > > > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng
> > > > > > > <s.wendy.cheng@xxxxxxxxx>
> > > > > > wrote:
> > > > > > >
> > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > >> <yanb@xxxxxxxxxxxx>
> > > > > > wrote:
> > > > > > >>> Hi.
> > > > > > >>>
> > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > >>> and I seem to
> > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > >>> memory, and
> > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > >>> backing storage on
> > > > > > the server is tmpfs mounted with noatime.
> > > > > > >>> I am running kernel 3.5.7.
> > > > > > >>>
> > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> 512K.
> > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > >>> for the
> > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> 980MB/sec.
> > > > > > >
> > > > > > > Yan,
> > > > > > >
> > > > > > > Are you trying to optimize single client performance or
> > > > > > > server performance
> > > > > > with multiple clients?
> > > > > > >
> > > > >
> > > > > I am trying to get maximum performance from a single server - I
> > > > > used 2
> > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > but the sum of
> > > > the two is more or less the same as running from single client PC.
> > > > >
> > > > > What I did see is that server is sweating a lot more than the
> > > > > clients and
> > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > cat /proc/softirqs
> > > >
> > > > Would any profiling help figure out which code it's spending time in?
> > > > (E.g. something simple as "perf top" might have useful output.)
> > > >
> > >
> > >
> > > Perf top for the CPU with high tasklet count gives:
> > >
> > >              samples  pcnt         RIP        function                    DSO
> > >              _______ _____ ________________
> > > ___________________________
> > >
> _________________________________________________________________
> __
> > >
> > >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> /root/vmlinux
> >
> > I guess that means lots of contention on some mutex?  If only we knew
> > which one.... perf should also be able to collect stack statistics, I
> > forget how.
> 
> Googling around....  I think we want:
> 
> 	perf record -a --call-graph
> 	(give it a chance to collect some samples, then ^C)
> 	perf report --call-graph --stdio
> 

Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is:
    36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
                    |
                    --- mutex_spin_on_owner
                       |
                       |--99.99%-- __mutex_lock_slowpath
                       |          mutex_lock
                       |          |
                       |          |--85.30%-- generic_file_aio_write
                       |          |          do_sync_readv_writev
                       |          |          do_readv_writev
                       |          |          vfs_writev
                       |          |          nfsd_vfs_write
                       |          |          nfsd_write
                       |          |          nfsd3_proc_write
                       |          |          nfsd_dispatch
                       |          |          svc_process_common
                       |          |          svc_process
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |           --14.70%-- svc_send
                       |                     svc_process
                       |                     nfsd
                       |                     kthread
                       |                     kernel_thread_helper
                        --0.01%-- [...]

     9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
                    |
                    --- _raw_spin_lock_irqsave
                       |
                       |--43.97%-- alloc_iova
                       |          intel_alloc_iova
                       |          __intel_map_single
                       |          intel_map_page
                       |          |
                       |          |--60.47%-- svc_rdma_sendto
                       |          |          svc_send
                       |          |          svc_process
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |          |--30.10%-- rdma_read_xdr
                       |          |          svc_rdma_recvfrom
                       |          |          svc_recv
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |          |--6.69%-- svc_rdma_post_recv
                       |          |          send_reply
                       |          |          svc_rdma_sendto
                       |          |          svc_send
                       |          |          svc_process
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |           --2.74%-- send_reply
                       |                     svc_rdma_sendto
                       |                     svc_send
                       |                     svc_process
                       |                     nfsd
                       |                     kthread
                       |                     kernel_thread_helper
                       |
                       |--37.52%-- __free_iova
                       |          flush_unmaps
                       |          add_unmap
                       |          intel_unmap_page
                       |          |
                       |          |--97.18%-- svc_rdma_put_frmr
                       |          |          sq_cq_reap
                       |          |          dto_tasklet_func
                       |          |          tasklet_action
                       |          |          __do_softirq
                       |          |          call_softirq
                       |          |          do_softirq
                       |          |          |
                       |          |          |--97.40%-- irq_exit
                       |          |          |          |
                       |          |          |          |--99.85%-- do_IRQ
                       |          |          |          |          ret_from_intr
                       |          |          |          |          |
                       |          |          |          |          |--40.74%-- generic_file_buffered_write
                       |          |          |          |          |          __generic_file_aio_write
                       |          |          |          |          |          generic_file_aio_write
                       |          |          |          |          |          do_sync_readv_writev
                       |          |          |          |          |          do_readv_writev
                       |          |          |          |          |          vfs_writev
                       |          |          |          |          |          nfsd_vfs_write
                       |          |          |          |          |          nfsd_write
                       |          |          |          |          |          nfsd3_proc_write
                       |          |          |          |          |          nfsd_dispatch
                       |          |          |          |          |          svc_process_common
                       |          |          |          |          |          svc_process
                       |          |          |          |          |          nfsd
                       |          |          |          |          |          kthread
                       |          |          |          |          |          kernel_thread_helper
                       |          |          |          |          |
                       |          |          |          |          |--25.21%-- __mutex_lock_slowpath
                       |          |          |          |          |          mutex_lock
                       |          |          |          |          |          |
                       |          |          |          |          |          |--94.84%-- generic_file_aio_write
                       |          |          |          |          |          |          do_sync_readv_writev
                       |          |          |          |          |          |          do_readv_writev
                       |          |          |          |          |          |          vfs_writev
                       |          |          |          |          |          |          nfsd_vfs_write
                       |          |          |          |          |          |          nfsd_write
                       |          |          |          |          |          |          nfsd3_proc_write
                       |          |          |          |          |          |          nfsd_dispatch
                       |          |          |          |          |          |          svc_process_common
                       |          |          |          |          |          |          svc_process
                       |          |          |          |          |          |          nfsd
                       |          |          |          |          |          |          kthread
                       |          |          |          |          |          |          kernel_thread_helper
                       |          |          |          |          |          |

The entire trace is almost 1MB, so send me an off-list message if you want it.

Yan

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux