On Mon, May 7, 2018 at 10:24 AM, Yishai Hadas <yishaih@xxxxxxxxxxxxxxxxxx> wrote: > On 5/7/2018 7:26 AM, Rohit Zambre wrote: >> >> On Sun, May 6, 2018 at 7:47 AM, Yishai Hadas <yishaih@xxxxxxxxxxxxxxxxxx> >> wrote: >>> >>> On 5/4/2018 12:46 AM, Rohit Zambre wrote: >>>> >>>> >>>> On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@xxxxxxxxx> >>>> wrote: >>>>> >>>>> >>>>> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@xxxxxxx> wrote: >>>>>> >>>>>> >>>>>> An independent communication path is one that shares no hardware >>>>>> resources >>>>>> with other communication paths. From a Verbs perspective, an >>>>>> independent >>>>>> path is the one obtained by the first QP in a context. The next QPs of >>>>>> the >>>>>> context may or may not share hardware resources amongst themselves; >>>>>> the >>>>>> mapping of the resources to the QPs is provider-specific. Sharing >>>>>> resources >>>>>> can hurt throughput in certain cases. When only one thread uses the >>>>>> independent path, we term it an uncontended independent path. >>>>>> >>>>>> Today, the user has no way to request for an independent path for an >>>>>> arbitrary QP within a context. To create multiple independent paths, >>>>>> the >>>>>> Verbs user must create mulitple contexts with 1 QP per context. >>>>>> However, >>>>>> this translates to significant hardware-resource wastage: 89% in the >>>>>> case >>>>>> of the ConnectX-4 mlx5 device. >>>>>> >>>>>> This RFC patch allows the user to request for uncontended independent >>>>>> communication paths in Verbs through an "independent" flag during >>>>>> Thread >>>>>> Domain (TD) creation. The patch also provides a first-draft >>>>>> implementation >>>>>> of uncontended independent paths in the mlx5 provider. >>>>>> >>>>>> In mlx5, every even-odd pair of TDs share the same UAR page, which is >>>>>> not >>>>>> case when the user creates multiple contexts with one TD per context. >>>>>> When >>>>>> the user requests for an independent TD, the driver will dynamically >>>>>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 >>>>>> of >>>>>> the UAR belonging to an independent TD is never used and is >>>>>> essentially >>>>>> wasted. Hence, there must be a maximum number of independent paths >>>>>> allowed >>>>>> within a context since the hardware resources are limited. This would >>>>>> be >>>>>> half of the maximum number of dynamic UARs allowed per context. >>>>> >>>>> >>>>> >>>>> I'm not sure I follow what you're trying to achieve here on the mlx5 HW >>>>> level. >>>>> Are you assuming that two threads with seperate 'indep-comm-paths' >>>>> using separate bfreg on the same UAR page causes some contention and >>>>> performance hit in the mlx5 HW? >>>>> We should first prove that's true, and then design a solution to solve >>>>> it. >>>>> Do you have benchmark results of any kind? >>>> >>>> >>>> >>>> Yes, there is a ~20% drop in message rates when there are concurrent >>>> BlueFlame writes to separate bfregs on the same UAR page. >>> >>> >>> >>> Can you please share your test code to help us make sure that you are >>> really >>> referring to the above case with the below analysis ? >> >> >> I have attached my benchmark code. The critical path of interest is >> lines 554-598. In the README, I have included an example of how to run >> the benchmark. Let me know if you have any questions/concerns >> regarding the benchmark code. >> >>>> The graph attached reports message rates using rdma-core for 2-byte >>>> RDMA-writes using 16 threads. Each thread is driving its own QP. Each >>>> thread has its own CQ. Thread Domains are not used in this benchmark. >>> >>> >>> >>> Can you try to use in your test TDs and see if you get the same results >>> before your initial patch ? this mode cleanly guarantees the 1<->1 UAR >>> bfreg >>> to a QP. >> >> >> I will most likely have the numbers with TDs in the second half of the >> week. I will report them here then. >> > > Yes, please share your results with the TDs usage with and without your > patch, this may help clarification the issue. > >>>> The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing >>>> means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame" >>>> and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2" >>>> means the size of the linked-list of WQEs is 2. >>> >>> >>> >>> Looking at your graph, the best results are wPostlist2-wBF, correct ? but >>> in >>> that case we don't expect BF at all but DB as you wrote below. Can you >>> please clarify the test and the results that are represented here ? >> >> >> Yes, correct. With postlist, rdma-core doesn't use BF, just DB. I >> included the various lines to show differences in behavior. The >> semantics of Verbs-users may or may not allow the use of features such >> as postlist. >> > > So what in the graph referred to your initial patch improvements ? the green > line was a DB test and not a BF results. The green line is without Postlist. So, the number of WQEs per ibv_post_send is 1. In this case, rdma-core uses BF, not DB. The graph doesn't show improvements from my patches; I'm just showing the current behavior under different scenarios: "wPostlist2-wBF" means DB is used on WC pages; "woPostlistwBF" means BF is used on WC pages; "woPostlistwoBF" means sending 1 WQE per ibv_post_send on UC pages. Hope this is clearer. Please let me know if you have more clarification questions. -Rohit -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html