Re: 2x difference between multi-thread and multi-process for same number of CTXs

Rohit Zambre <rzambre@xxxxxxx> · Wed, 24 Jan 2018 14:53:13 -0600

On Wed, Jan 24, 2018 at 11:08 AM, Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote:
>
>> (1) First, is this a surprising result or is the 2x difference
>> actually expected behavior?
>
> Maybe, there are lots of locks in one process, for instance glibc's
> malloc has locking - so any memory allocation anywhere in the
> applications processing path will cause lock contention. The issue may
> have nothing to do with RDMA.

There are no mallocs in the critical path of the benchmark. In the 1
process multi-threaded case, the mallocs for resource creation are all
before creating the OpenMP parallel region. Here's a snapshot of the
parallel region that contains the critical path:

#pragma omp parallel
        {
            int i = omp_get_thread_num(), k;
            int cqe_count = 0;
            int post_count = 0;
            int comp_count = 0;
            int posts = 0;

            struct ibv_send_wr *bad_send_wqe;
            struct ibv_wc *WC = (struct ibv_wc*) malloc(qp_depth *
sizeof(struct ibv_wc) ); // qp_depth is 128 (adopted from perftest)

            #pragma omp single
            { // only one thread will execute this
                MPI_Barrier(MPI_COMM_WORLD);
            } // implicit barrier for the threads
            if (i == 0)
                t_start = MPI_Wtime();

            /* Critical Path Start */
            while (post_count < posts_per_qp || comp_count <
posts_per_qp) { // posts_per_qp = num_of_msgs / num_qps
                /* Post */
                posts = min( (posts_per_qp - post_count), (qp_depth -
(post_count - comp_count) ) );
                for (k = 0; k < posts; k++)
                    ret = ibv_post_send(qp[i], &send_wqe[i], &bad_send_wqe);
                post_count += posts;
                /* Poll */
                if (comp_count < posts_per_qp) {
                     cqe_count = ibv_poll_cq(cq[i], num_comps, WC); //
num_comps = qp_depth
                     comp_count += cqe_count;
                 }
             } /* Critical Path End */
             if (i == 0)
                 t_end = MPI_Wtime();
         }

> There is also some locking inside the userspace mlx5 driver that may
> contend depending on how your process has set things up.

I missed mentioning this but I collected the numbers with
MLX5_SINGLE_THREADED set since none of the resources were being shared
between the threads. So, the userspace driver wasn't taking any locks.

> The entire send path is in user space so there is no kernel component
> here.

Yes, that's correct. My concern was that during resource creation, the
kernel was maybe sharing some resource for a process or that some sort
of multiplexing was occurring to hardware contexts through control
groups. Is it safe for me to conclude that separate, independent
contexts/bfregs are being assigned when a process calls
ibv_open_device multiple times?

> Jason

Thanks,
Rohit Zambre
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html