On Sun, May 6, 2018 at 7:47 AM, Yishai Hadas <yishaih@xxxxxxxxxxxxxxxxxx> wrote: > On 5/4/2018 12:46 AM, Rohit Zambre wrote: >> >> On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@xxxxxxxxx> >> wrote: >>> >>> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@xxxxxxx> wrote: >>>> >>>> An independent communication path is one that shares no hardware >>>> resources >>>> with other communication paths. From a Verbs perspective, an independent >>>> path is the one obtained by the first QP in a context. The next QPs of >>>> the >>>> context may or may not share hardware resources amongst themselves; the >>>> mapping of the resources to the QPs is provider-specific. Sharing >>>> resources >>>> can hurt throughput in certain cases. When only one thread uses the >>>> independent path, we term it an uncontended independent path. >>>> >>>> Today, the user has no way to request for an independent path for an >>>> arbitrary QP within a context. To create multiple independent paths, the >>>> Verbs user must create mulitple contexts with 1 QP per context. However, >>>> this translates to significant hardware-resource wastage: 89% in the >>>> case >>>> of the ConnectX-4 mlx5 device. >>>> >>>> This RFC patch allows the user to request for uncontended independent >>>> communication paths in Verbs through an "independent" flag during Thread >>>> Domain (TD) creation. The patch also provides a first-draft >>>> implementation >>>> of uncontended independent paths in the mlx5 provider. >>>> >>>> In mlx5, every even-odd pair of TDs share the same UAR page, which is >>>> not >>>> case when the user creates multiple contexts with one TD per context. >>>> When >>>> the user requests for an independent TD, the driver will dynamically >>>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 >>>> of >>>> the UAR belonging to an independent TD is never used and is essentially >>>> wasted. Hence, there must be a maximum number of independent paths >>>> allowed >>>> within a context since the hardware resources are limited. This would be >>>> half of the maximum number of dynamic UARs allowed per context. >>> >>> >>> I'm not sure I follow what you're trying to achieve here on the mlx5 HW >>> level. >>> Are you assuming that two threads with seperate 'indep-comm-paths' >>> using separate bfreg on the same UAR page causes some contention and >>> performance hit in the mlx5 HW? >>> We should first prove that's true, and then design a solution to solve >>> it. >>> Do you have benchmark results of any kind? >> >> >> Yes, there is a ~20% drop in message rates when there are concurrent >> BlueFlame writes to separate bfregs on the same UAR page. > > > Can you please share your test code to help us make sure that you are really > referring to the above case with the below analysis ? I have attached my benchmark code. The critical path of interest is lines 554-598. In the README, I have included an example of how to run the benchmark. Let me know if you have any questions/concerns regarding the benchmark code. >> The graph attached reports message rates using rdma-core for 2-byte >> RDMA-writes using 16 threads. Each thread is driving its own QP. Each >> thread has its own CQ. Thread Domains are not used in this benchmark. > > > Can you try to use in your test TDs and see if you get the same results > before your initial patch ? this mode cleanly guarantees the 1<->1 UAR bfreg > to a QP. I will most likely have the numbers with TDs in the second half of the week. I will report them here then. >> The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing >> means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame" >> and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2" >> means the size of the linked-list of WQEs is 2. > > > Looking at your graph, the best results are wPostlist2-wBF, correct ? but in > that case we don't expect BF at all but DB as you wrote below. Can you > please clarify the test and the results that are represented here ? Yes, correct. With postlist, rdma-core doesn't use BF, just DB. I included the various lines to show differences in behavior. The semantics of Verbs-users may or may not allow the use of features such as postlist. > >> "woPostlist" means >> each thread is posting only 1 WQE per ibv_post_send. These numbers are >> on a ConnectX-4 mlx5 device (on the Gomez machine of JLSE). The >> numbers are the same on the ConnectX-4 device on the Thor cluster of >> the HPC Advisory Council. The behavior with MOFED is the same with >> slight differences in absolute numbers; the drop is ~15%. >> >> The first drop in the green line is due to concurrent BlueFlame writes >> on the same UAR page. The second drop is due to bfreg lock contention >> between the 5th and the 16th QP. With a postlist size greater than 1, >> rdma-core does only 64-bit DoorBells. Concurrent Doorbells don't hurt. >> Concurrent BlueFlame writes do. What is exactly causing this, I am not >> sure. But from some more experimenting, I think the answer lies in how >> the NIC finds out whether to fetch the WQE from the BlueFlame buffer >> or DMA-read it from memory. I wasn't able to find a "bit" that was set >> during WQE preparation that tells the NIC where to read from. But it >> could be something else entirely.. >> >> We are addressing the green line with this patch. >> >>> When you create two seperate ibv_context you will separate a lot more >>> then just the UAR pages on which the bfreg are mapped. Ehe entier >>> software locking scheme is separated. >> >> >> Right. In the description, I wanted to emphasize the independent path >> aspect of different contexts since that is most important to the MPI >> library. The locking can be controlled through Thread Domains. >> >>> The ibv_td object allows the user to separate resources so that locks >>> could be managed in a smarter way in the provider lib data fast path. >>> For that we allocate a bfreg for each ibv_td obj. Using a dedicated >>> bfreg allows lower latency sends, as the doorbell does not need a lock >>> to write the even/odd entries. >>> At the time we did not extend the work to cover additional locks in >>> mlx5. but it seems your series is targeting something else. >> >> >> If you are referring to [1], then that patch is targeting just to >> disable QP-lock if a Thread Domain is specified. To create an >> independent software path, the MPI library will use the Thread Domain. >> >> [1] https://patchwork.kernel.org/patch/10367419/ >> >
<<attachment: bench.zip>>