Re: 2x difference between multi-thread and multi-process for same number of CTXs

Rohit Zambre <rzambre@xxxxxxx> · Fri, 26 Jan 2018 10:44:13 -0600

On Wed, Jan 24, 2018 at 4:00 PM, Anuj Kalia <anujkaliaiitd@xxxxxxxxx> wrote:
> IMO this is probably an implementation issue in the benchmarking code,
> and I'm curious to know the issue if you find it.
>
> It's possible to achieve 150+ million writes per second with a
> multi-threaded process. See Figure 12 in our paper:
> http://www.cs.cmu.edu/~akalia/doc/atc16/rdma_bench_atc.pdf. Our
> benchmark code is available:
> https://github.com/efficient/rdma_bench/tree/master/rw-tput-sender.

I read through your paper and code (great work!) but I don't think it
is an implementation issue. I am comparing my numbers against Figure
12b of your paper since the CX3 cluster is the closest to my testbed
which is a single-port ConnectX-4 card. Hugepages is the only
optimization we use; we don't use doorbell batching, unsignaled
completions, inlining, etc. However, the numbers are comparable: ~27M
writes/second from our benchmark without your optimizations VS ~35M
writes/second from your benchmark with all the optimizations. The 150M
writes/s on the CIB cluster is on a dual-port card. More importantly,
the ~35M writes/s on the CX3 cluster is >1.5x lower than the ~55M
writes/s that we see with multi processes benchmark without
optimizations.

> --Anuj
>
> On Wed, Jan 24, 2018 at 3:53 PM, Rohit Zambre <rzambre@xxxxxxx> wrote:
>> On Wed, Jan 24, 2018 at 11:08 AM, Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>>> On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote:
>>>
>>>> (1) First, is this a surprising result or is the 2x difference
>>>> actually expected behavior?
>>>
>>> Maybe, there are lots of locks in one process, for instance glibc's
>>> malloc has locking - so any memory allocation anywhere in the
>>> applications processing path will cause lock contention. The issue may
>>> have nothing to do with RDMA.
>>
>> There are no mallocs in the critical path of the benchmark. In the 1
>> process multi-threaded case, the mallocs for resource creation are all
>> before creating the OpenMP parallel region. Here's a snapshot of the
>> parallel region that contains the critical path:
>>
>> #pragma omp parallel
>>         {
>>             int i = omp_get_thread_num(), k;
>>             int cqe_count = 0;
>>             int post_count = 0;
>>             int comp_count = 0;
>>             int posts = 0;
>>
>>             struct ibv_send_wr *bad_send_wqe;
>>             struct ibv_wc *WC = (struct ibv_wc*) malloc(qp_depth *
>> sizeof(struct ibv_wc) ); // qp_depth is 128 (adopted from perftest)
>>
>>             #pragma omp single
>>             { // only one thread will execute this
>>                 MPI_Barrier(MPI_COMM_WORLD);
>>             } // implicit barrier for the threads
>>             if (i == 0)
>>                 t_start = MPI_Wtime();
>>
>>             /* Critical Path Start */
>>             while (post_count < posts_per_qp || comp_count <
>> posts_per_qp) { // posts_per_qp = num_of_msgs / num_qps
>>                 /* Post */
>>                 posts = min( (posts_per_qp - post_count), (qp_depth -
>> (post_count - comp_count) ) );
>>                 for (k = 0; k < posts; k++)
>>                     ret = ibv_post_send(qp[i], &send_wqe[i], &bad_send_wqe);
>>                 post_count += posts;
>>                 /* Poll */
>>                 if (comp_count < posts_per_qp) {
>>                      cqe_count = ibv_poll_cq(cq[i], num_comps, WC); //
>> num_comps = qp_depth
>>                      comp_count += cqe_count;
>>                  }
>>              } /* Critical Path End */
>>              if (i == 0)
>>                  t_end = MPI_Wtime();
>>          }
>>
>>> There is also some locking inside the userspace mlx5 driver that may
>>> contend depending on how your process has set things up.
>>
>> I missed mentioning this but I collected the numbers with
>> MLX5_SINGLE_THREADED set since none of the resources were being shared
>> between the threads. So, the userspace driver wasn't taking any locks.
>>
>>> The entire send path is in user space so there is no kernel component
>>> here.
>>
>> Yes, that's correct. My concern was that during resource creation, the
>> kernel was maybe sharing some resource for a process or that some sort
>> of multiplexing was occurring to hardware contexts through control
>> groups. Is it safe for me to conclude that separate, independent
>> contexts/bfregs are being assigned when a process calls
>> ibv_open_device multiple times?
>>
>>> Jason
>>
>> Thanks,
>> Rohit Zambre
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
<<attachment: multi-threadVSproc.zip>>