How many WC buffers per mlx5 UAR page?

Rohit Zambre <rzambre@xxxxxxx> · Wed, 18 Apr 2018 08:11:50 -0500

Hi,

I am sending 2-byte RDMA-write message rates with 16 threads from a
sender to a receiver. I am sending 1 message every ibv_post_send with
inlining; so mlx5_bf_copy() is used. I am measuring message rate using
two configurations: (1) 16 Contexts with 1 QP per Context; (2) 8
Contexts with 2 QPs per Context. The message rate in (2) is 15% lower
than that in (1) and I am trying to understand why this is the case.

In (2), I learned that I can eliminate the 15% drop by creating 4 QPs
per Context but use only QP_0 and QP_2 within each Context. Yes, this
is hacky but the purpose is to understand behavior. This is method
(3).

The difference between (2) and (3) is that the QPs being used in (3)
are on different UAR pages, same as in (1). In (2), the QPs are on the
same UAR page.

The number of sfence barriers is the same in all cases. In (2) the
threads are calling sfence on memory that lies on the same UAR page
while in (1)/(3), the threads are calling sfence on memory that lies
on different UAR pages. mlx5_bf_copy() writes 64 bytes, the size of a
WC buffer.

One theory to explain the 15% drop is that there is only 1 WC buffer
per UAR page; since the WC buffers maintain states like caches, if the
1 WC buffer is being flushed it cannot be modified by the other thread
writing to the same UAR page. So mlx5_bf_copy of each thread in (2) is
serialized by the sfence flush. But my understanding is that multiple
WC buffers exist per core and I am not sure which system-layer maps
WCs to pages. Could someone confirm the number of WC buffers per UAR
page or point me to where I should be looking to find out?

Thanks,
-Rohit Zambre
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html