Re: [RFC PATCH] verbs: Introduce mlx5: Implement uncontended independent communication paths

Rohit Zambre <rzambre@xxxxxxx> · Thu, 3 May 2018 16:46:23 -0500

On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@xxxxxxxxx> wrote:
> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@xxxxxxx> wrote:
>> An independent communication path is one that shares no hardware resources
>> with other communication paths. From a Verbs perspective, an independent
>> path is the one obtained by the first QP in a context. The next QPs of the
>> context may or may not share hardware resources amongst themselves; the
>> mapping of the resources to the QPs is provider-specific. Sharing resources
>> can hurt throughput in certain cases. When only one thread uses the
>> independent path, we term it an uncontended independent path.
>>
>> Today, the user has no way to request for an independent path for an
>> arbitrary QP within a context. To create multiple independent paths, the
>> Verbs user must create mulitple contexts with 1 QP per context. However,
>> this translates to significant hardware-resource wastage: 89% in the case
>> of the ConnectX-4 mlx5 device.
>>
>> This RFC patch allows the user to request for uncontended independent
>> communication paths in Verbs through an "independent" flag during Thread
>> Domain (TD) creation. The patch also provides a first-draft implementation
>> of uncontended independent paths in the mlx5 provider.
>>
>> In mlx5, every even-odd pair of TDs share the same UAR page, which is not
>> case when the user creates multiple contexts with one TD per context. When
>> the user requests for an independent TD, the driver will dynamically
>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 of
>> the UAR belonging to an independent TD is never used and is essentially
>> wasted. Hence, there must be a maximum number of independent paths allowed
>> within a context since the hardware resources are limited. This would be
>> half of the maximum number of dynamic UARs allowed per context.
>
> I'm not sure I follow what you're trying to achieve here on the mlx5 HW level.
> Are you assuming that two threads with seperate 'indep-comm-paths'
> using separate bfreg on the same UAR page causes some contention and
> performance hit in the mlx5 HW?
> We should first prove that's true, and then design a solution to solve it.
> Do you have benchmark results of any kind?

Yes, there is a ~20% drop in message rates when there are concurrent
BlueFlame writes to separate bfregs on the same UAR page.

The graph attached reports message rates using rdma-core for 2-byte
RDMA-writes using 16 threads. Each thread is driving its own QP. Each
thread has its own CQ. Thread Domains are not used in this benchmark.
The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing
means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame"
and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2"
means the size of the linked-list of WQEs is 2. "woPostlist" means
each thread is posting only 1 WQE per ibv_post_send. These numbers are
on a ConnectX-4 mlx5 device (on the Gomez machine of JLSE). The
numbers are the same on the ConnectX-4 device on the Thor cluster of
the HPC Advisory Council. The behavior with MOFED is the same with
slight differences in absolute numbers; the drop is ~15%.

The first drop in the green line is due to concurrent BlueFlame writes
on the same UAR page. The second drop is due to bfreg lock contention
between the 5th and the 16th QP. With a postlist size greater than 1,
rdma-core does only 64-bit DoorBells. Concurrent Doorbells don't hurt.
Concurrent BlueFlame writes do. What is exactly causing this, I am not
sure. But from some more experimenting, I think the answer lies in how
the NIC finds out whether to fetch the WQE from the BlueFlame buffer
or DMA-read it from memory. I wasn't able to find a "bit" that was set
during WQE preparation that tells the NIC where to read from. But it
could be something else entirely..

We are addressing the green line with this patch.

> When you create two seperate ibv_context you will separate a lot more
> then just the UAR pages on which the bfreg are mapped. Ehe entier
> software locking scheme is separated.

Right. In the description, I wanted to emphasize the independent path
aspect of different contexts since that is most important to the MPI
library. The locking can be controlled through Thread Domains.

> The ibv_td object allows the user to separate resources so that locks
> could be managed in a smarter way in the provider lib data fast path.
> For that we allocate a bfreg for each ibv_td obj. Using a dedicated
> bfreg allows lower latency sends, as the doorbell does not need a lock
> to write the even/odd entries.
> At the time we did not extend the work to cover additional locks in
> mlx5. but it seems your series is targeting something else.

If you are referring to [1], then that patch is targeting just to
disable QP-lock if a Thread Domain is specified. To create an
independent software path, the MPI library will use the Thread Domain.

[1] https://patchwork.kernel.org/patch/10367419/
Attachment:
rdma_core_ctx_sharing.png

Description: PNG image