Hi, I am sending 2-byte RDMA-write message rates with 16 threads from a sender to a receiver. I am sending 1 message every ibv_post_send with inlining; so mlx5_bf_copy() is used. I am measuring message rate using two configurations: (1) 16 Contexts with 1 QP per Context; (2) 8 Contexts with 2 QPs per Context. The message rate in (2) is 15% lower than that in (1) and I am trying to understand why this is the case. In (2), I learned that I can eliminate the 15% drop by creating 4 QPs per Context but use only QP_0 and QP_2 within each Context. Yes, this is hacky but the purpose is to understand behavior. This is method (3). The difference between (2) and (3) is that the QPs being used in (3) are on different UAR pages, same as in (1). In (2), the QPs are on the same UAR page. The number of sfence barriers is the same in all cases. In (2) the threads are calling sfence on memory that lies on the same UAR page while in (1)/(3), the threads are calling sfence on memory that lies on different UAR pages. mlx5_bf_copy() writes 64 bytes, the size of a WC buffer. One theory to explain the 15% drop is that there is only 1 WC buffer per UAR page; since the WC buffers maintain states like caches, if the 1 WC buffer is being flushed it cannot be modified by the other thread writing to the same UAR page. So mlx5_bf_copy of each thread in (2) is serialized by the sfence flush. But my understanding is that multiple WC buffers exist per core and I am not sure which system-layer maps WCs to pages. Could someone confirm the number of WC buffers per UAR page or point me to where I should be looking to find out? Thanks, -Rohit Zambre -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html