On Fri, Oct 18, 2024 4:07 PM Zhu Yanjun wrote: > 在 2024/10/9 3:58, Daisuke Matsuda 写道: > > This patch series implements the On-Demand Paging feature on SoftRoCE(rxe) > > driver, which has been available only in mlx5 driver[1] so far. > > > > This series has been blocked because of the hang issue of srp 002 test[2], > > which was believed to be caused after applying the commit 9b4b7c1f9f54 > > ("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent > > on the commit because the ODP feature requires sleeping in kernel space, > > and it is impossible with the former tasklet implementation. > > > > According to the original reporter[3], the hang issue is already gone in > > v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe > > driver is ready to accept this series since there is no longer any reason > > to consider reverting back to the old tasklet. > > > > I omitted some contents like the motive behind this series from the cover- > > letter. Please see the cover letter of v3 for more details[5]. > > > > [Overview] > > When applications register a memory region(MR), RDMA drivers normally pin > > pages in the MR so that physical addresses are never changed during RDMA > > communication. This requires the MR to fit in physical memory and > > inevitably leads to memory pressure. On the other hand, On-Demand Paging > > (ODP) allows applications to register MRs without pinning pages. They are > > paged-in when the driver requires and paged-out when the OS reclaims. As a > > result, it is possible to register a large MR that does not fit in physical > > memory without taking up so much physical memory. > > > > [How does ODP work?] > > "struct ib_umem_odp" is used to manage pages. It is created for each > > ODP-enabled MR on its registration. This struct holds a pair of arrays > > (dma_list/pfn_list) that serve as a driver page table. DMA addresses and > > PFNs are stored in the driver page table. They are updated on page-in and > > page-out, both of which use the common interfaces in the ib_uverbs layer. > > > > Page-in can occur when requester, responder or completer access an MR in > > order to process RDMA operations. If they find that the pages being > > accessed are not present on physical memory or requisite permissions are > > not set on the pages, they provoke page fault to make the pages present > > with proper permissions and at the same time update the driver page table. > > After confirming the presence of the pages, they execute memory access such > > as read, write or atomic operations. > > > > Page-out is triggered by page reclaim or filesystem events (e.g. metadata > > update of a file that is being used as an MR). When creating an ODP-enabled > > MR, the driver registers an MMU notifier callback. When the kernel issues a > > page invalidation notification, the callback is provoked to unmap DMA > > addresses and update the driver page table. After that, the kernel releases > > the pages. > > > > [Supported operations] > > All traditional operations are supported on RC connection. The new Atomic > > write[6] and RDMA Flush[7] operations are not included in this patchset. I > > will post them later after this patchset is merged. On UD connection, Send, > > Recv, and SRQ-Recv are supported. > > > > [How to test ODP?] > > There are only a few resources available for testing. pyverbs testcases in > > rdma-core and perftest[8] are recommendable ones. Other than them, the > > ibv_rc_pingpong command can also be used for testing. Note that you may > > have to build perftest from upstream because old versions do not handle ODP > > capabilities correctly. > > Thanks a lot. I have tested these patches with perftest. Because ODP (On > Demand Paging) is a feature, can you also add some testcases into rdma > core? So we can use rdma-core to make tests with this feature of rxe. I added Read/Write/Atomics tests two years ago. Cf. https://github.com/linux-rdma/rdma-core/pull/1229 Each of ODP testcases causes page invalidation so that RDMA traffic access triggers ODP page-in flow. Currently, 7 testcases below can pass on rxe ODP v8 implementation. test_odp_rc_atomic_cmp_and_swp test_odp_rc_atomic_fetch_and_add test_odp_rc_mixed_mr test_odp_rc_rdma_read test_odp_rc_rdma_write test_odp_rc_traffic test_odp_ud_traffic The rest 11 tests are just skipped because of lack of capabilities. Please let me know if you have any suggestions for improvement. Thanks, Daisuke Matsuda > > That is, add some testcases in run_tests.py, so use run_tests.py to > verify this (ODP) feature on rxe. > > Thanks, > Zhu Yanjun > > > > > The latest ODP tree is available from github: > > https://github.com/ddmatsu/linux/tree/odp_v8 > > > > [Future work] > > My next work is to enable the new Atomic write[6] and RDMA Flush[7] > > operations with ODP. After that, I am going to implement the prefetch > > feature. It allows applications to trigger page fault using > > ibv_advise_mr(3) to optimize performance. Some existing software like > > librpma[9] use this feature. Additionally, I think we can also add the > > implicit ODP feature in the future. > > > > [1] Understanding On Demand Paging (ODP) > > https://enterprise-support.nvidia.com/s/article/understanding-on-demand-paging--odp-x > > > > [2] [bug report] blktests srp/002 hang > > https://lore.kernel.org/linux-rdma/dsg6rd66tyiei32zaxs6ddv5ebefr5vtxjwz6d2ewqrcwisogl@ge7jzan7dg5u/T/ > > > > [3] blktests failures with v6.10-rc1 kernel > > https://lore.kernel.org/linux-block/wnucs5oboi4flje5yvtea7puvn6zzztcnlrfz3lpzlwgblrxgw@7wvqdzioejgl/ > > > > [4] [00/15] ethernet: Convert from tasklet to BH workqueue > > https://patchwork.kernel.org/project/linux-rdma/cover/20240621050525.3720069-1-allen.lkml@xxxxxxxxx/ > > > > [5] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE > > https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@xxxxxxxxxxx/ > > > > [6] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation > > https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@xxxxxxxxxxx/ > > > > [7] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation > > https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@xxxxxxxxxxx/ > > > > [8] linux-rdma/perftest: Infiniband Verbs Performance Tests > > https://github.com/linux-rdma/perftest > > > > [9] librpma: Remote Persistent Memory Access Library > > https://github.com/pmem/rpma > > > > v7->v8: > > 1) Dropped the first patch because the same change was made by Bob Pearson. > > cf. https://github.com/torvalds/linux/commit/23bc06af547f2ca3b7d345e09fd8d04575406274 > > 2) Rebased to 6.12.1-rc2 > > > > v6->v7: > > 1) Rebased to 6.6.0 > > 2) Disabled using hugepages with ODP > > 3) Addressed comments on v6 from Jason and Zhu > > cf. https://lore.kernel.org/lkml/cover.1694153251.git.matsuda-daisuke@xxxxxxxxxxx/ > > > > v5->v6: > > Fixed the implementation according to Jason's suggestions > > cf. https://lore.kernel.org/all/ZIdFXfDu4IMKE+BQ@xxxxxxxxxx/ > > cf. https://lore.kernel.org/all/ZIdGU709e1h5h4JJ@xxxxxxxxxx/ > > > > v4->v5: > > 1) Rebased to 6.4.0-rc2+ > > 2) Changed to schedule all works on responder and completer to workqueue > > > > v3->v4: > > 1) Re-designed functions that access MRs to use the MR xarray. > > 2) Rebased onto the latest jgg-for-next tree. > > > > v2->v3: > > 1) Removed a patch that changes the common ib_uverbs layer. > > 2) Re-implemented patches for conversion to workqueue. > > 3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n). > > 4) Fixed some functions that returned incorrect errors. > > 5) Temporarily disabled ODP for RDMA Flush and Atomic Write. > > > > v1->v2: > > 1) Fixed a crash issue reported by Haris Iqbal. > > 2) Tried to make lock patters clearer as pointed out by Romanovsky. > > 3) Minor clean ups and fixes. > > > > Daisuke Matsuda (6): > > RDMA/rxe: Make MR functions accessible from other rxe source code > > RDMA/rxe: Move resp_states definition to rxe_verbs.h > > RDMA/rxe: Add page invalidation support > > RDMA/rxe: Allow registering MRs for On-Demand Paging > > RDMA/rxe: Add support for Send/Recv/Write/Read with ODP > > RDMA/rxe: Add support for the traditional Atomic operations with ODP > > > > drivers/infiniband/sw/rxe/Makefile | 2 + > > drivers/infiniband/sw/rxe/rxe.c | 18 ++ > > drivers/infiniband/sw/rxe/rxe.h | 37 ---- > > drivers/infiniband/sw/rxe/rxe_loc.h | 39 ++++ > > drivers/infiniband/sw/rxe/rxe_mr.c | 34 +++- > > drivers/infiniband/sw/rxe/rxe_odp.c | 282 ++++++++++++++++++++++++++ > > drivers/infiniband/sw/rxe/rxe_resp.c | 18 +- > > drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +- > > drivers/infiniband/sw/rxe/rxe_verbs.h | 37 ++++ > > 9 files changed, 419 insertions(+), 53 deletions(-) > > create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c > >