Re: [RFC PATCH] verbs: Introduce mlx5: Implement uncontended independent communication paths

Rohit Zambre <rzambre@xxxxxxx> · Sun, 6 May 2018 23:26:49 -0500

On Sun, May 6, 2018 at 7:47 AM, Yishai Hadas <yishaih@xxxxxxxxxxxxxxxxxx> wrote:
> On 5/4/2018 12:46 AM, Rohit Zambre wrote:
>>
>> On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@xxxxxxxxx>
>> wrote:
>>>
>>> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@xxxxxxx> wrote:
>>>>
>>>> An independent communication path is one that shares no hardware
>>>> resources
>>>> with other communication paths. From a Verbs perspective, an independent
>>>> path is the one obtained by the first QP in a context. The next QPs of
>>>> the
>>>> context may or may not share hardware resources amongst themselves; the
>>>> mapping of the resources to the QPs is provider-specific. Sharing
>>>> resources
>>>> can hurt throughput in certain cases. When only one thread uses the
>>>> independent path, we term it an uncontended independent path.
>>>>
>>>> Today, the user has no way to request for an independent path for an
>>>> arbitrary QP within a context. To create multiple independent paths, the
>>>> Verbs user must create mulitple contexts with 1 QP per context. However,
>>>> this translates to significant hardware-resource wastage: 89% in the
>>>> case
>>>> of the ConnectX-4 mlx5 device.
>>>>
>>>> This RFC patch allows the user to request for uncontended independent
>>>> communication paths in Verbs through an "independent" flag during Thread
>>>> Domain (TD) creation. The patch also provides a first-draft
>>>> implementation
>>>> of uncontended independent paths in the mlx5 provider.
>>>>
>>>> In mlx5, every even-odd pair of TDs share the same UAR page, which is
>>>> not
>>>> case when the user creates multiple contexts with one TD per context.
>>>> When
>>>> the user requests for an independent TD, the driver will dynamically
>>>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1
>>>> of
>>>> the UAR belonging to an independent TD is never used and is essentially
>>>> wasted. Hence, there must be a maximum number of independent paths
>>>> allowed
>>>> within a context since the hardware resources are limited. This would be
>>>> half of the maximum number of dynamic UARs allowed per context.
>>>
>>>
>>> I'm not sure I follow what you're trying to achieve here on the mlx5 HW
>>> level.
>>> Are you assuming that two threads with seperate 'indep-comm-paths'
>>> using separate bfreg on the same UAR page causes some contention and
>>> performance hit in the mlx5 HW?
>>> We should first prove that's true, and then design a solution to solve
>>> it.
>>> Do you have benchmark results of any kind?
>>
>>
>> Yes, there is a ~20% drop in message rates when there are concurrent
>> BlueFlame writes to separate bfregs on the same UAR page.
>
>
> Can you please share your test code to help us make sure that you are really
> referring to the above case with the below analysis ?

I have attached my benchmark code. The critical path of interest is
lines 554-598. In the README, I have included an example of how to run
the benchmark. Let me know if you have any questions/concerns
regarding the benchmark code.

>> The graph attached reports message rates using rdma-core for 2-byte
>> RDMA-writes using 16 threads. Each thread is driving its own QP. Each
>> thread has its own CQ. Thread Domains are not used in this benchmark.
>
>
> Can you try to use in your test TDs and see if you get the same results
> before your initial patch ? this mode cleanly guarantees the 1<->1 UAR bfreg
> to a QP.

I will most likely have the numbers with TDs in the second half of the
week. I will report them here then.

>> The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing
>> means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame"
>> and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2"
>> means the size of the linked-list of WQEs is 2.
>
>
> Looking at your graph, the best results are wPostlist2-wBF, correct ? but in
> that case we don't expect BF at all but DB as you wrote below. Can you
> please clarify the test and the results that are represented here ?

Yes, correct. With postlist, rdma-core doesn't use BF, just DB. I
included the various lines to show differences in behavior. The
semantics of Verbs-users may or may not allow the use of features such
as postlist.

>
>> "woPostlist" means
>> each thread is posting only 1 WQE per ibv_post_send. These numbers are
>> on a ConnectX-4 mlx5 device (on the Gomez machine of JLSE). The
>> numbers are the same on the ConnectX-4 device on the Thor cluster of
>> the HPC Advisory Council. The behavior with MOFED is the same with
>> slight differences in absolute numbers; the drop is ~15%.
>>
>> The first drop in the green line is due to concurrent BlueFlame writes
>> on the same UAR page. The second drop is due to bfreg lock contention
>> between the 5th and the 16th QP. With a postlist size greater than 1,
>> rdma-core does only 64-bit DoorBells. Concurrent Doorbells don't hurt.
>> Concurrent BlueFlame writes do. What is exactly causing this, I am not
>> sure. But from some more experimenting, I think the answer lies in how
>> the NIC finds out whether to fetch the WQE from the BlueFlame buffer
>> or DMA-read it from memory. I wasn't able to find a "bit" that was set
>> during WQE preparation that tells the NIC where to read from. But it
>> could be something else entirely..
>>
>> We are addressing the green line with this patch.
>>
>>> When you create two seperate ibv_context you will separate a lot more
>>> then just the UAR pages on which the bfreg are mapped. Ehe entier
>>> software locking scheme is separated.
>>
>>
>> Right. In the description, I wanted to emphasize the independent path
>> aspect of different contexts since that is most important to the MPI
>> library. The locking can be controlled through Thread Domains.
>>
>>> The ibv_td object allows the user to separate resources so that locks
>>> could be managed in a smarter way in the provider lib data fast path.
>>> For that we allocate a bfreg for each ibv_td obj. Using a dedicated
>>> bfreg allows lower latency sends, as the doorbell does not need a lock
>>> to write the even/odd entries.
>>> At the time we did not extend the work to cover additional locks in
>>> mlx5. but it seems your series is targeting something else.
>>
>>
>> If you are referring to [1], then that patch is targeting just to
>> disable QP-lock if a Thread Domain is specified. To create an
>> independent software path, the MPI library will use the Thread Domain.
>>
>> [1] https://patchwork.kernel.org/patch/10367419/
>>
>
<<attachment: bench.zip>>