On Wed, Jan 24, 2018 at 4:00 PM, Anuj Kalia <anujkaliaiitd@xxxxxxxxx> wrote: > IMO this is probably an implementation issue in the benchmarking code, > and I'm curious to know the issue if you find it. > > It's possible to achieve 150+ million writes per second with a > multi-threaded process. See Figure 12 in our paper: > http://www.cs.cmu.edu/~akalia/doc/atc16/rdma_bench_atc.pdf. Our > benchmark code is available: > https://github.com/efficient/rdma_bench/tree/master/rw-tput-sender. I read through your paper and code (great work!) but I don't think it is an implementation issue. I am comparing my numbers against Figure 12b of your paper since the CX3 cluster is the closest to my testbed which is a single-port ConnectX-4 card. Hugepages is the only optimization we use; we don't use doorbell batching, unsignaled completions, inlining, etc. However, the numbers are comparable: ~27M writes/second from our benchmark without your optimizations VS ~35M writes/second from your benchmark with all the optimizations. The 150M writes/s on the CIB cluster is on a dual-port card. More importantly, the ~35M writes/s on the CX3 cluster is >1.5x lower than the ~55M writes/s that we see with multi processes benchmark without optimizations. > --Anuj > > On Wed, Jan 24, 2018 at 3:53 PM, Rohit Zambre <rzambre@xxxxxxx> wrote: >> On Wed, Jan 24, 2018 at 11:08 AM, Jason Gunthorpe <jgg@xxxxxxxx> wrote: >>> On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote: >>> >>>> (1) First, is this a surprising result or is the 2x difference >>>> actually expected behavior? >>> >>> Maybe, there are lots of locks in one process, for instance glibc's >>> malloc has locking - so any memory allocation anywhere in the >>> applications processing path will cause lock contention. The issue may >>> have nothing to do with RDMA. >> >> There are no mallocs in the critical path of the benchmark. In the 1 >> process multi-threaded case, the mallocs for resource creation are all >> before creating the OpenMP parallel region. Here's a snapshot of the >> parallel region that contains the critical path: >> >> #pragma omp parallel >> { >> int i = omp_get_thread_num(), k; >> int cqe_count = 0; >> int post_count = 0; >> int comp_count = 0; >> int posts = 0; >> >> struct ibv_send_wr *bad_send_wqe; >> struct ibv_wc *WC = (struct ibv_wc*) malloc(qp_depth * >> sizeof(struct ibv_wc) ); // qp_depth is 128 (adopted from perftest) >> >> #pragma omp single >> { // only one thread will execute this >> MPI_Barrier(MPI_COMM_WORLD); >> } // implicit barrier for the threads >> if (i == 0) >> t_start = MPI_Wtime(); >> >> /* Critical Path Start */ >> while (post_count < posts_per_qp || comp_count < >> posts_per_qp) { // posts_per_qp = num_of_msgs / num_qps >> /* Post */ >> posts = min( (posts_per_qp - post_count), (qp_depth - >> (post_count - comp_count) ) ); >> for (k = 0; k < posts; k++) >> ret = ibv_post_send(qp[i], &send_wqe[i], &bad_send_wqe); >> post_count += posts; >> /* Poll */ >> if (comp_count < posts_per_qp) { >> cqe_count = ibv_poll_cq(cq[i], num_comps, WC); // >> num_comps = qp_depth >> comp_count += cqe_count; >> } >> } /* Critical Path End */ >> if (i == 0) >> t_end = MPI_Wtime(); >> } >> >>> There is also some locking inside the userspace mlx5 driver that may >>> contend depending on how your process has set things up. >> >> I missed mentioning this but I collected the numbers with >> MLX5_SINGLE_THREADED set since none of the resources were being shared >> between the threads. So, the userspace driver wasn't taking any locks. >> >>> The entire send path is in user space so there is no kernel component >>> here. >> >> Yes, that's correct. My concern was that during resource creation, the >> kernel was maybe sharing some resource for a process or that some sort >> of multiplexing was occurring to hardware contexts through control >> groups. Is it safe for me to conclude that separate, independent >> contexts/bfregs are being assigned when a process calls >> ibv_open_device multiple times? >> >>> Jason >> >> Thanks, >> Rohit Zambre >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html
<<attachment: multi-threadVSproc.zip>>