Re: [RFC PATCH] verbs: Introduce mlx5: Implement uncontended independent communication paths

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 7, 2018 at 10:24 AM, Yishai Hadas
<yishaih@xxxxxxxxxxxxxxxxxx> wrote:
> On 5/7/2018 7:26 AM, Rohit Zambre wrote:
>>
>> On Sun, May 6, 2018 at 7:47 AM, Yishai Hadas <yishaih@xxxxxxxxxxxxxxxxxx>
>> wrote:
>>>
>>> On 5/4/2018 12:46 AM, Rohit Zambre wrote:
>>>>
>>>>
>>>> On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@xxxxxxxxx>
>>>> wrote:
>>>>>
>>>>>
>>>>> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@xxxxxxx> wrote:
>>>>>>
>>>>>>
>>>>>> An independent communication path is one that shares no hardware
>>>>>> resources
>>>>>> with other communication paths. From a Verbs perspective, an
>>>>>> independent
>>>>>> path is the one obtained by the first QP in a context. The next QPs of
>>>>>> the
>>>>>> context may or may not share hardware resources amongst themselves;
>>>>>> the
>>>>>> mapping of the resources to the QPs is provider-specific. Sharing
>>>>>> resources
>>>>>> can hurt throughput in certain cases. When only one thread uses the
>>>>>> independent path, we term it an uncontended independent path.
>>>>>>
>>>>>> Today, the user has no way to request for an independent path for an
>>>>>> arbitrary QP within a context. To create multiple independent paths,
>>>>>> the
>>>>>> Verbs user must create mulitple contexts with 1 QP per context.
>>>>>> However,
>>>>>> this translates to significant hardware-resource wastage: 89% in the
>>>>>> case
>>>>>> of the ConnectX-4 mlx5 device.
>>>>>>
>>>>>> This RFC patch allows the user to request for uncontended independent
>>>>>> communication paths in Verbs through an "independent" flag during
>>>>>> Thread
>>>>>> Domain (TD) creation. The patch also provides a first-draft
>>>>>> implementation
>>>>>> of uncontended independent paths in the mlx5 provider.
>>>>>>
>>>>>> In mlx5, every even-odd pair of TDs share the same UAR page, which is
>>>>>> not
>>>>>> case when the user creates multiple contexts with one TD per context.
>>>>>> When
>>>>>> the user requests for an independent TD, the driver will dynamically
>>>>>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1
>>>>>> of
>>>>>> the UAR belonging to an independent TD is never used and is
>>>>>> essentially
>>>>>> wasted. Hence, there must be a maximum number of independent paths
>>>>>> allowed
>>>>>> within a context since the hardware resources are limited. This would
>>>>>> be
>>>>>> half of the maximum number of dynamic UARs allowed per context.
>>>>>
>>>>>
>>>>>
>>>>> I'm not sure I follow what you're trying to achieve here on the mlx5 HW
>>>>> level.
>>>>> Are you assuming that two threads with seperate 'indep-comm-paths'
>>>>> using separate bfreg on the same UAR page causes some contention and
>>>>> performance hit in the mlx5 HW?
>>>>> We should first prove that's true, and then design a solution to solve
>>>>> it.
>>>>> Do you have benchmark results of any kind?
>>>>
>>>>
>>>>
>>>> Yes, there is a ~20% drop in message rates when there are concurrent
>>>> BlueFlame writes to separate bfregs on the same UAR page.
>>>
>>>
>>>
>>> Can you please share your test code to help us make sure that you are
>>> really
>>> referring to the above case with the below analysis ?
>>
>>
>> I have attached my benchmark code. The critical path of interest is
>> lines 554-598. In the README, I have included an example of how to run
>> the benchmark. Let me know if you have any questions/concerns
>> regarding the benchmark code.
>>
>>>> The graph attached reports message rates using rdma-core for 2-byte
>>>> RDMA-writes using 16 threads. Each thread is driving its own QP. Each
>>>> thread has its own CQ. Thread Domains are not used in this benchmark.
>>>
>>>
>>>
>>> Can you try to use in your test TDs and see if you get the same results
>>> before your initial patch ? this mode cleanly guarantees the 1<->1 UAR
>>> bfreg
>>> to a QP.
>>
>>
>> I will most likely have the numbers with TDs in the second half of the
>> week. I will report them here then.
>>
>
> Yes, please share your results with the TDs usage with and without your
> patch, this may help clarification the issue.
>
>>>> The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing
>>>> means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame"
>>>> and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2"
>>>> means the size of the linked-list of WQEs is 2.
>>>
>>>
>>>
>>> Looking at your graph, the best results are wPostlist2-wBF, correct ? but
>>> in
>>> that case we don't expect BF at all but DB as you wrote below. Can you
>>> please clarify the test and the results that are represented here ?
>>
>>
>> Yes, correct. With postlist, rdma-core doesn't use BF, just DB. I
>> included the various lines to show differences in behavior. The
>> semantics of Verbs-users may or may not allow the use of features such
>> as postlist.
>>
>
> So what in the graph referred to your initial patch improvements ? the green
> line was a DB test and not a BF results.

The green line is without Postlist. So, the number of WQEs per
ibv_post_send is 1. In this case, rdma-core uses BF, not DB. The graph
doesn't show improvements from my patches; I'm just showing the
current behavior under different scenarios: "wPostlist2-wBF" means DB
is used on WC pages; "woPostlistwBF" means BF is used on WC pages;
"woPostlistwoBF" means sending 1 WQE per ibv_post_send on UC pages.
Hope this is clearer. Please let me know if you have more
clarification questions.

-Rohit
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux