Re: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



在 2024/10/28 8:59, Daisuke Matsuda (Fujitsu) 写道:
On Fri, Oct 18, 2024 4:07 PM Zhu Yanjun wrote:
在 2024/10/9 3:58, Daisuke Matsuda 写道:
This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
driver, which has been available only in mlx5 driver[1] so far.

This series has been blocked because of the hang issue of srp 002 test[2],
which was believed to be caused after applying the commit 9b4b7c1f9f54
("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
on the commit because the ODP feature requires sleeping in kernel space,
and it is impossible with the former tasklet implementation.

According to the original reporter[3], the hang issue is already gone in
v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
driver is ready to accept this series since there is no longer any reason
to consider reverting back to the old tasklet.

I omitted some contents like the motive behind this series from the cover-
letter. Please see the cover letter of v3 for more details[5].

[Overview]
When applications register a memory region(MR), RDMA drivers normally pin
pages in the MR so that physical addresses are never changed during RDMA
communication. This requires the MR to fit in physical memory and
inevitably leads to memory pressure. On the other hand, On-Demand Paging
(ODP) allows applications to register MRs without pinning pages. They are
paged-in when the driver requires and paged-out when the OS reclaims. As a
result, it is possible to register a large MR that does not fit in physical
memory without taking up so much physical memory.

[How does ODP work?]
"struct ib_umem_odp" is used to manage pages. It is created for each
ODP-enabled MR on its registration. This struct holds a pair of arrays
(dma_list/pfn_list) that serve as a driver page table. DMA addresses and
PFNs are stored in the driver page table. They are updated on page-in and
page-out, both of which use the common interfaces in the ib_uverbs layer.

Page-in can occur when requester, responder or completer access an MR in
order to process RDMA operations. If they find that the pages being
accessed are not present on physical memory or requisite permissions are
not set on the pages, they provoke page fault to make the pages present
with proper permissions and at the same time update the driver page table.
After confirming the presence of the pages, they execute memory access such
as read, write or atomic operations.

Page-out is triggered by page reclaim or filesystem events (e.g. metadata
update of a file that is being used as an MR). When creating an ODP-enabled
MR, the driver registers an MMU notifier callback. When the kernel issues a
page invalidation notification, the callback is provoked to unmap DMA
addresses and update the driver page table. After that, the kernel releases
the pages.

[Supported operations]
All traditional operations are supported on RC connection. The new Atomic
write[6] and RDMA Flush[7] operations are not included in this patchset. I
will post them later after this patchset is merged. On UD connection, Send,
Recv, and SRQ-Recv are supported.

[How to test ODP?]
There are only a few resources available for testing. pyverbs testcases in
rdma-core and perftest[8] are recommendable ones. Other than them, the
ibv_rc_pingpong command can also be used for testing. Note that you may
have to build perftest from upstream because old versions do not handle ODP
capabilities correctly.

Thanks a lot. I have tested these patches with perftest. Because ODP (On
Demand Paging) is a feature, can you also add some testcases into rdma
core? So we can use rdma-core to make tests with this feature of rxe.

I added Read/Write/Atomics tests two years ago.
Cf. https://github.com/linux-rdma/rdma-core/pull/1229

Each of ODP testcases causes page invalidation so that RDMA traffic
access triggers ODP page-in flow.

Currently, 7 testcases below can pass on rxe ODP v8 implementation.
   test_odp_rc_atomic_cmp_and_swp
   test_odp_rc_atomic_fetch_and_add
   test_odp_rc_mixed_mr
   test_odp_rc_rdma_read
   test_odp_rc_rdma_write
   test_odp_rc_traffic
   test_odp_ud_traffic
The rest 11 tests are just skipped because of lack of capabilities.

Thanks. Run rdma-core, the above tests can also work successfully in my test environment.
I am fine with this.

Zhu Yanjun


Please let me know if you have any suggestions for improvement.

Thanks,
Daisuke Matsuda


That is, add some testcases in run_tests.py, so use run_tests.py to
verify this (ODP) feature on rxe.

Thanks,
Zhu Yanjun


The latest ODP tree is available from github:
https://github.com/ddmatsu/linux/tree/odp_v8

[Future work]
My next work is to enable the new Atomic write[6] and RDMA Flush[7]
operations with ODP. After that, I am going to implement the prefetch
feature. It allows applications to trigger page fault using
ibv_advise_mr(3) to optimize performance. Some existing software like
librpma[9] use this feature. Additionally, I think we can also add the
implicit ODP feature in the future.

[1] Understanding On Demand Paging (ODP)
https://enterprise-support.nvidia.com/s/article/understanding-on-demand-paging--odp-x

[2] [bug report] blktests srp/002 hang
https://lore.kernel.org/linux-rdma/dsg6rd66tyiei32zaxs6ddv5ebefr5vtxjwz6d2ewqrcwisogl@ge7jzan7dg5u/T/

[3] blktests failures with v6.10-rc1 kernel
https://lore.kernel.org/linux-block/wnucs5oboi4flje5yvtea7puvn6zzztcnlrfz3lpzlwgblrxgw@7wvqdzioejgl/

[4] [00/15] ethernet: Convert from tasklet to BH workqueue
https://patchwork.kernel.org/project/linux-rdma/cover/20240621050525.3720069-1-allen.lkml@xxxxxxxxx/

[5] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@xxxxxxxxxxx/

[6] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@xxxxxxxxxxx/

[7] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@xxxxxxxxxxx/

[8] linux-rdma/perftest: Infiniband Verbs Performance Tests
https://github.com/linux-rdma/perftest

[9] librpma: Remote Persistent Memory Access Library
https://github.com/pmem/rpma

v7->v8:
   1) Dropped the first patch because the same change was made by Bob Pearson.
   cf. https://github.com/torvalds/linux/commit/23bc06af547f2ca3b7d345e09fd8d04575406274
   2) Rebased to 6.12.1-rc2

v6->v7:
   1) Rebased to 6.6.0
   2) Disabled using hugepages with ODP
   3) Addressed comments on v6 from Jason and Zhu
     cf. https://lore.kernel.org/lkml/cover.1694153251.git.matsuda-daisuke@xxxxxxxxxxx/

v5->v6:
   Fixed the implementation according to Jason's suggestions
     cf. https://lore.kernel.org/all/ZIdFXfDu4IMKE+BQ@xxxxxxxxxx/
     cf. https://lore.kernel.org/all/ZIdGU709e1h5h4JJ@xxxxxxxxxx/

v4->v5:
   1) Rebased to 6.4.0-rc2+
   2) Changed to schedule all works on responder and completer to workqueue

v3->v4:
   1) Re-designed functions that access MRs to use the MR xarray.
   2) Rebased onto the latest jgg-for-next tree.

v2->v3:
   1) Removed a patch that changes the common ib_uverbs layer.
   2) Re-implemented patches for conversion to workqueue.
   3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
   4) Fixed some functions that returned incorrect errors.
   5) Temporarily disabled ODP for RDMA Flush and Atomic Write.

v1->v2:
   1) Fixed a crash issue reported by Haris Iqbal.
   2) Tried to make lock patters clearer as pointed out by Romanovsky.
   3) Minor clean ups and fixes.

Daisuke Matsuda (6):
    RDMA/rxe: Make MR functions accessible from other rxe source code
    RDMA/rxe: Move resp_states definition to rxe_verbs.h
    RDMA/rxe: Add page invalidation support
    RDMA/rxe: Allow registering MRs for On-Demand Paging
    RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
    RDMA/rxe: Add support for the traditional Atomic operations with ODP

   drivers/infiniband/sw/rxe/Makefile    |   2 +
   drivers/infiniband/sw/rxe/rxe.c       |  18 ++
   drivers/infiniband/sw/rxe/rxe.h       |  37 ----
   drivers/infiniband/sw/rxe/rxe_loc.h   |  39 ++++
   drivers/infiniband/sw/rxe/rxe_mr.c    |  34 +++-
   drivers/infiniband/sw/rxe/rxe_odp.c   | 282 ++++++++++++++++++++++++++
   drivers/infiniband/sw/rxe/rxe_resp.c  |  18 +-
   drivers/infiniband/sw/rxe/rxe_verbs.c |   5 +-
   drivers/infiniband/sw/rxe/rxe_verbs.h |  37 ++++
   9 files changed, 419 insertions(+), 53 deletions(-)
   create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c







[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux