On Sat, Sep 10, 2022 4:39 AM Bob Pearson wrote: > On 9/6/22 21:43, Daisuke Matsuda wrote: > > In order to implement On-Demand Paging on the rxe driver, triple tasklets > > (requester, responder, and completer) must be allowed to sleep so that they > > can trigger page fault when pages being accessed are not present. > > > > This patch replaces the tasklets with workqueues, but still allows direct- > > call of works from softirq context if it is obvious that MRs are not going > > to be accessed and there is no work being processed in the workqueue. > > > > As counterparts to tasklet_disable() and tasklet_enable() are missing for > > workqueues, an atomic value is introduced to get works suspended while qp > > reset is in progress. > > > > As a reference, performance change was measured using ib_send_bw and > > ib_send_lat commands over RC connection. Both the client and the server > > were placed on the same KVM host. An option "-n 100000" was given to the > > respective commands to iterate over 100000 times. > > > > Before applying this patch: > > [ib_send_bw] > > --------------------------------------------------------------------------------------- > > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] > > 65536 100000 0.00 203.13 0.003250 > > --------------------------------------------------------------------------------------- > > [ib_send_lat] > > --------------------------------------------------------------------------------------- > > #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% > percentile[usec] 99.9% percentile[usec] > > 2 100000 30.91 1519.05 34.23 36.06 5.15 48.49 > 82.37 > > --------------------------------------------------------------------------------------- > > > > After applying this patch: > > [ib_send_bw] > > --------------------------------------------------------------------------------------- > > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] > > 65536 100000 0.00 240.88 0.003854 > > --------------------------------------------------------------------------------------- > > [ib_send_lat] > > --------------------------------------------------------------------------------------- > > #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% > percentile[usec] 99.9% percentile[usec] > > 2 100000 40.88 2994.82 47.80 50.25 13.70 76.42 > 185.04 > > --------------------------------------------------------------------------------------- > > > > The test was conducted 3 times for each kernel, and the results with median > > "BW average" and "t_typical" are shown above. It shows the conversion > > improves the bandwidth while causing higher latency. > > > > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@xxxxxxxxxxx> > > --- > > drivers/infiniband/sw/rxe/Makefile | 2 +- > > drivers/infiniband/sw/rxe/rxe_comp.c | 42 ++++--- > > drivers/infiniband/sw/rxe/rxe_loc.h | 2 +- > > drivers/infiniband/sw/rxe/rxe_net.c | 4 +- > > drivers/infiniband/sw/rxe/rxe_param.h | 2 +- > > drivers/infiniband/sw/rxe/rxe_qp.c | 68 +++++------ > > drivers/infiniband/sw/rxe/rxe_recv.c | 2 +- > > drivers/infiniband/sw/rxe/rxe_req.c | 14 +-- > > drivers/infiniband/sw/rxe/rxe_resp.c | 16 +-- > > drivers/infiniband/sw/rxe/rxe_task.c | 152 ------------------------ > > drivers/infiniband/sw/rxe/rxe_task.h | 69 ----------- > > drivers/infiniband/sw/rxe/rxe_verbs.c | 8 +- > > drivers/infiniband/sw/rxe/rxe_verbs.h | 8 +- > > drivers/infiniband/sw/rxe/rxe_wq.c | 161 ++++++++++++++++++++++++++ > > drivers/infiniband/sw/rxe/rxe_wq.h | 71 ++++++++++++ > > 15 files changed, 322 insertions(+), 299 deletions(-) > > delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c > > delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h > > create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c > > create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h <...> > > Daiskuke, > > We (hpe) have an internal patch set that also allows using work queues instead of tasklets. > There is a certain amount of work to allow control over cpu assignment from ulp or application > level. This is critical for high performance in multithreaded IO applications. It would be in > both of our interests if we could find a common implementation that works for everyone. I agree with you, and I am interested in your work. Would you post the patch set? It may be possible to rebase my work on your implementation. I just simply replaced tasklets mechanically and modified the conditions when dispatching works to responder and completer workqueues. The changes are not so complicated. I'll be away between 9/15-9/20. After that, I can take the time to look into it. By the way, I would also like you to join in the discussion about whether to preserve tasklets or not. Cf. https://lore.kernel.org/all/TYCPR01MB8455D739FC9FB034E3485C87E5449@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > Perhaps we could arrange for a call to discuss? This is an important change that may affect all of rxe users, so I think the discussion should be open to anybody at least at this stage. Thanks, Daisuke