Hi, I have been trying to pinpoint the cause for a mlx5 behavior/problem but haven't been able to yet. It would be great if you could share your thoughts or point me towards what I should be looking at. I am running a simple, sender-receiver micro-benchmark (like ib_write_bw) to calculate the message rate for RDMA writes with increasing number of endpoints. Each endpoint has its own independent resources: Context, PD, QP, CQ, MR, buffer. None of the resources are shared between the endpoints. I am running this benchmark for N endpoints in two ways: using multiple threads and using multiple processes. In the multi-threaded case, I have 1 process that creates the N endpoints with all of its resources but uses N threads to drive the endpoints; each thread will post only on its QP and poll only its CQ. In the multi process case, I have N processes; each process creates and drives only 1 endpoint and its resources. In both cases, ibv_open_device, ibv_alloc_pd, ibv_reg_mr, ibv_create_cq and ibv_create_qp each have been called N times. My understanding (from reading the user-space driver code) is that a new struct ibv_context is allocated every time ibv_open_device is called regardless of whether all the ibv_open_device calls have been called by the same process or different processes. So, in both cases, there are N endpoints on the sender-node system. Theoretically, the message rates should then be the same for both cases when using multiple endpoints. However, in the graph attached, you will see that while both the multi-thread and multi-process cases scale with increase in endpoints, there is a >2x difference between the two cases when we have 8 CTXs. The graph shows RDMA-write message rates for 2-byte messages. I collected these numbers on the Thor cluster of the HPC Advisory Council. A thor node's specs are: 16 cores on a socket, 1 ConnectX-4 card (with 1 active port: mlx5_0), RHEL 7.2 and kernel 3.10.0-327.el7.x86_64. Using the binding options of MPI and OpenMP, I have made sure that each process/thread is bound to its own core. I am using MPI to only launch the processes and exchange connection information; all of the communication is through the libibverbs API. Since I wasn't able to find any distinctions in the user-space code, I have been going through the kernel code to find the cause of this behavior. While I haven't been able to pinpoint on something specific, I have noted that the current struct is used in ib_umem_get, which is called by mmap, ibv_reg_mr and ibv_poll_cq. I'm currently studying these to find the cause but again I am not sure if I am in the right direction. Here are some questions for you: (1) First, is this a surprising result or is the 2x difference actually expected behavior? (2) Could you point me to places in the kernel code that I should be studying to understand the cause of this behavior? OR do you have any suggestions for experiments I should try to possibly eliminate potential causes? If you would like me to share my micro-benchmark code so you can reproduce the results, let me know. Thank you, Rohit Zambre Ph.D. Student, Computer Engineering University of California, Irvine
Attachment:
write_mr_small_multiProcVSmultiThread_mlx5.pdf
Description: Adobe PDF document