On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote: > > > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman > > > > > > > >> <yanb@xxxxxxxxxxxx> > > > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA > > > > > > > >>> and I seem to > > > > > > > only get about half of the bandwidth that the HW can give me. > > > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of > > > > > > > >>> memory, and > > > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3. > > > > > > > >>> These servers are connected to a QDR IB switch. The > > > > > > > >>> backing storage on > > > > > > > the server is tmpfs mounted with noatime. > > > > > > > >>> I am running kernel 3.5.7. > > > > > > > >>> > > > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4- > > 512K. > > > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec > > > > > > > >>> for the > > > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200- > > 980MB/sec. ... > > > > > > I am trying to get maximum performance from a single server - I > > > > > > used 2 > > > > > processes in fio test - more than 2 did not show any performance boost. > > > > > > I tried running fio from 2 different PCs on 2 different files, > > > > > > but the sum of > > > > > the two is more or less the same as running from single client PC. > > > > > > > > > > > > What I did see is that server is sweating a lot more than the > > > > > > clients and > > > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet: > > > > > > cat /proc/softirqs ... > > > > Perf top for the CPU with high tasklet count gives: > > > > > > > > samples pcnt RIP function DSO ... > > > > 2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner > > /root/vmlinux ... > > Googling around.... I think we want: > > > > perf record -a --call-graph > > (give it a chance to collect some samples, then ^C) > > perf report --call-graph --stdio > > > > Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is: > 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner > | > --- mutex_spin_on_owner > | > |--99.99%-- __mutex_lock_slowpath > | mutex_lock > | | > | |--85.30%-- generic_file_aio_write That's the inode i_mutex. > | | do_sync_readv_writev > | | do_readv_writev > | | vfs_writev > | | nfsd_vfs_write > | | nfsd_write > | | nfsd3_proc_write > | | nfsd_dispatch > | | svc_process_common > | | svc_process > | | nfsd > | | kthread > | | kernel_thread_helper > | | > | --14.70%-- svc_send That's the xpt_mutex (ensuring rpc replies aren't interleaved). > | svc_process > | nfsd > | kthread > | kernel_thread_helper > --0.01%-- [...] > > 9.63% nfsd [kernel.kallsyms] [k] _raw_spin_lock_irqsave > | > --- _raw_spin_lock_irqsave > | > |--43.97%-- alloc_iova And that (and __free_iova below) looks like iova_rbtree_lock. --b. > | intel_alloc_iova > | __intel_map_single > | intel_map_page > | | > | |--60.47%-- svc_rdma_sendto > | | svc_send > | | svc_process > | | nfsd > | | kthread > | | kernel_thread_helper > | | > | |--30.10%-- rdma_read_xdr > | | svc_rdma_recvfrom > | | svc_recv > | | nfsd > | | kthread > | | kernel_thread_helper > | | > | |--6.69%-- svc_rdma_post_recv > | | send_reply > | | svc_rdma_sendto > | | svc_send > | | svc_process > | | nfsd > | | kthread > | | kernel_thread_helper > | | > | --2.74%-- send_reply > | svc_rdma_sendto > | svc_send > | svc_process > | nfsd > | kthread > | kernel_thread_helper > | > |--37.52%-- __free_iova > | flush_unmaps > | add_unmap > | intel_unmap_page > | | > | |--97.18%-- svc_rdma_put_frmr > | | sq_cq_reap > | | dto_tasklet_func > | | tasklet_action > | | __do_softirq > | | call_softirq > | | do_softirq > | | | > | | |--97.40%-- irq_exit > | | | | > | | | |--99.85%-- do_IRQ > | | | | ret_from_intr > | | | | | > | | | | |--40.74%-- generic_file_buffered_write > | | | | | __generic_file_aio_write > | | | | | generic_file_aio_write > | | | | | do_sync_readv_writev > | | | | | do_readv_writev > | | | | | vfs_writev > | | | | | nfsd_vfs_write > | | | | | nfsd_write > | | | | | nfsd3_proc_write > | | | | | nfsd_dispatch > | | | | | svc_process_common > | | | | | svc_process > | | | | | nfsd > | | | | | kthread > | | | | | kernel_thread_helper > | | | | | > | | | | |--25.21%-- __mutex_lock_slowpath > | | | | | mutex_lock > | | | | | | > | | | | | |--94.84%-- generic_file_aio_write > | | | | | | do_sync_readv_writev > | | | | | | do_readv_writev > | | | | | | vfs_writev > | | | | | | nfsd_vfs_write > | | | | | | nfsd_write > | | | | | | nfsd3_proc_write > | | | | | | nfsd_dispatch > | | | | | | svc_process_common > | | | | | | svc_process > | | | | | | nfsd > | | | | | | kthread > | | | | | | kernel_thread_helper > | | | | | | > > The entire trace is almost 1MB, so send me an off-list message if you want it. > > Yan > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html