Re: [RFC PATCH] verbs: Introduce mlx5: Implement uncontended independent communication paths

Yishai Hadas <yishaih@xxxxxxxxxxxxxxxxxx> · Mon, 7 May 2018 18:24:49 +0300

On 5/7/2018 7:26 AM, Rohit Zambre wrote:
On Sun, May 6, 2018 at 7:47 AM, Yishai Hadas <yishaih@xxxxxxxxxxxxxxxxxx> wrote:
On 5/4/2018 12:46 AM, Rohit Zambre wrote:

On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@xxxxxxxxx>
wrote:

On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@xxxxxxx> wrote:

An independent communication path is one that shares no hardware
resources
with other communication paths. From a Verbs perspective, an independent
path is the one obtained by the first QP in a context. The next QPs of
the
context may or may not share hardware resources amongst themselves; the
mapping of the resources to the QPs is provider-specific. Sharing
resources
can hurt throughput in certain cases. When only one thread uses the
independent path, we term it an uncontended independent path.

Today, the user has no way to request for an independent path for an
arbitrary QP within a context. To create multiple independent paths, the
Verbs user must create mulitple contexts with 1 QP per context. However,
this translates to significant hardware-resource wastage: 89% in the
case
of the ConnectX-4 mlx5 device.

This RFC patch allows the user to request for uncontended independent
communication paths in Verbs through an "independent" flag during Thread
Domain (TD) creation. The patch also provides a first-draft
implementation
of uncontended independent paths in the mlx5 provider.

In mlx5, every even-odd pair of TDs share the same UAR page, which is
not
case when the user creates multiple contexts with one TD per context.
When
the user requests for an independent TD, the driver will dynamically
allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1
of
the UAR belonging to an independent TD is never used and is essentially
wasted. Hence, there must be a maximum number of independent paths
allowed
within a context since the hardware resources are limited. This would be
half of the maximum number of dynamic UARs allowed per context.

I'm not sure I follow what you're trying to achieve here on the mlx5 HW
level.
Are you assuming that two threads with seperate 'indep-comm-paths'
using separate bfreg on the same UAR page causes some contention and
performance hit in the mlx5 HW?
We should first prove that's true, and then design a solution to solve
it.
Do you have benchmark results of any kind?

Yes, there is a ~20% drop in message rates when there are concurrent
BlueFlame writes to separate bfregs on the same UAR page.

Can you please share your test code to help us make sure that you are really
referring to the above case with the below analysis ?

I have attached my benchmark code. The critical path of interest is
lines 554-598. In the README, I have included an example of how to run
the benchmark. Let me know if you have any questions/concerns
regarding the benchmark code.

The graph attached reports message rates using rdma-core for 2-byte
RDMA-writes using 16 threads. Each thread is driving its own QP. Each
thread has its own CQ. Thread Domains are not used in this benchmark.

Can you try to use in your test TDs and see if you get the same results
before your initial patch ? this mode cleanly guarantees the 1<->1 UAR bfreg
to a QP.

I will most likely have the numbers with TDs in the second half of the
week. I will report them here then.

Yes, please share your results with the TDs usage with and without your 
patch, this may help clarification the issue.

The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing
means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame"
and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2"
means the size of the linked-list of WQEs is 2.

Looking at your graph, the best results are wPostlist2-wBF, correct ? but in
that case we don't expect BF at all but DB as you wrote below. Can you
please clarify the test and the results that are represented here ?

Yes, correct. With postlist, rdma-core doesn't use BF, just DB. I
included the various lines to show differences in behavior. The
semantics of Verbs-users may or may not allow the use of features such
as postlist.

So what in the graph referred to your initial patch improvements ? the 
green line was a DB test and not a BF results.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html