2x difference between multi-thread and multi-process for same number of CTXs

Rohit Zambre <rzambre@xxxxxxx> · Wed, 24 Jan 2018 10:22:53 -0600

Hi,

I have been trying to pinpoint the cause for a mlx5 behavior/problem
but haven't been able to yet. It would be great if you could share
your thoughts or point me towards what I should be looking at.

I am running a simple, sender-receiver micro-benchmark (like
ib_write_bw) to calculate the message rate for RDMA writes with
increasing number of endpoints. Each endpoint has its own independent
resources: Context, PD, QP, CQ, MR, buffer. None of the resources are
shared between the endpoints. I am running this benchmark for N
endpoints in two ways: using multiple threads and using multiple
processes.

In the multi-threaded case, I have 1 process that creates the N
endpoints with all of its resources but uses N threads to drive the
endpoints; each thread will post only on its QP and poll only its CQ.
In the multi process case, I have N processes; each process creates
and drives only 1 endpoint and its resources. In both cases,
ibv_open_device, ibv_alloc_pd, ibv_reg_mr, ibv_create_cq and
ibv_create_qp each have been called N times. My understanding (from
reading the user-space driver code) is that a new struct ibv_context
is allocated every time ibv_open_device is called regardless of
whether all the ibv_open_device calls have been called by the same
process or different processes. So, in both cases, there are N
endpoints on the sender-node system. Theoretically, the message rates
should then be the same for both cases when using multiple endpoints.

However, in the graph attached, you will see that while both the
multi-thread and multi-process cases scale with increase in endpoints,
there is a >2x difference between the two cases when we have 8 CTXs.
The graph shows RDMA-write message rates for 2-byte messages. I
collected these numbers on the Thor cluster of the HPC Advisory
Council. A thor node's specs are: 16 cores on a socket, 1 ConnectX-4
card (with 1 active port: mlx5_0), RHEL 7.2 and kernel
3.10.0-327.el7.x86_64. Using the binding options of MPI and OpenMP, I
have made sure that each process/thread is bound to its own core. I am
using MPI to only launch the processes and exchange connection
information; all of the communication is through the libibverbs API.

Since I wasn't able to find any distinctions in the user-space code, I
have been going through the kernel code to find the cause of this
behavior. While I haven't been able to pinpoint on something specific,
I have noted that the current struct is used in ib_umem_get, which is
called by mmap, ibv_reg_mr and ibv_poll_cq. I'm currently studying
these to find the cause but again I am not sure if I am in the right
direction. Here are some questions for you:

(1) First, is this a surprising result or is the 2x difference
actually expected behavior?
(2) Could you point me to places in the kernel code that I should be
studying to understand the cause of this behavior? OR do you have any
suggestions for experiments I should try to possibly eliminate
potential causes?

If you would like me to share my micro-benchmark code so you can
reproduce the results, let me know.

Thank you,
Rohit Zambre
Ph.D. Student, Computer Engineering
University of California, Irvine
Attachment:
write_mr_small_multiProcVSmultiThread_mlx5.pdf

Description: Adobe PDF document