Re: How many WC buffers per mlx5 UAR page?

Rohit Zambre <rzambre@xxxxxxx> · Thu, 3 May 2018 16:53:13 -0500

On Thu, May 3, 2018 at 3:51 PM, Alex Rosenbaum <rosenbaumalex@xxxxxxxxx> wrote:
> Rohit,
>
> Do you have addition benchmark results for each patch stage of the
> locking and bfreg changes you submitted?

I don't have incremental numbers, yet. I will have the kernel with the
dynamic UAR allocation support only next week. I will report numbers
then.

> How many CQ's did you use? 1 per thread? or 1 per QP?

For the experiment in this thread, I used 1 "active" QP per thread. By
"active" I mean the QP the thread was using in method (3). In methods
(1) and (2), all QPs are "active" QPs. I used 1 CQ per thread which
translates to 1 CQ per "active" QP.

-Rohit

> Alex
>
>
> On Wed, Apr 18, 2018 at 4:11 PM, Rohit Zambre <rzambre@xxxxxxx> wrote:
>> Hi,
>>
>> I am sending 2-byte RDMA-write message rates with 16 threads from a
>> sender to a receiver. I am sending 1 message every ibv_post_send with
>> inlining; so mlx5_bf_copy() is used. I am measuring message rate using
>> two configurations: (1) 16 Contexts with 1 QP per Context; (2) 8
>> Contexts with 2 QPs per Context. The message rate in (2) is 15% lower
>> than that in (1) and I am trying to understand why this is the case.
>>
>> In (2), I learned that I can eliminate the 15% drop by creating 4 QPs
>> per Context but use only QP_0 and QP_2 within each Context. Yes, this
>> is hacky but the purpose is to understand behavior. This is method
>> (3).
>>
>> The difference between (2) and (3) is that the QPs being used in (3)
>> are on different UAR pages, same as in (1). In (2), the QPs are on the
>> same UAR page.
>>
>> The number of sfence barriers is the same in all cases. In (2) the
>> threads are calling sfence on memory that lies on the same UAR page
>> while in (1)/(3), the threads are calling sfence on memory that lies
>> on different UAR pages. mlx5_bf_copy() writes 64 bytes, the size of a
>> WC buffer.
>>
>> One theory to explain the 15% drop is that there is only 1 WC buffer
>> per UAR page; since the WC buffers maintain states like caches, if the
>> 1 WC buffer is being flushed it cannot be modified by the other thread
>> writing to the same UAR page. So mlx5_bf_copy of each thread in (2) is
>> serialized by the sfence flush. But my understanding is that multiple
>> WC buffers exist per core and I am not sure which system-layer maps
>> WCs to pages. Could someone confirm the number of WC buffers per UAR
>> page or point me to where I should be looking to find out?
>>
>> Thanks,
>> -Rohit Zambre
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html