Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

Yang Shi <shy828301@xxxxxxxxx> · Fri, 3 Jan 2025 14:09:39 -0800

On Fri, Jan 3, 2025 at 9:24 AM Zi Yan <ziy@xxxxxxxxxx> wrote:
>
> Hi all,
>
> This patchset accelerates page migration by batching folio copy operations and
> using multiple CPU threads and is based on Shivank's Enhancements to Page
> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
> The last patch is for testing purpose and should not be considered.
>
> The motivations are:
>
> 1. Batching folio copy increases copy throughput. Especially for base page
> migrations, folio copy throughput is low since there are kernel activities like
> moving folio metadata and updating page table entries sit between two folio
> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
> and 64KB on ARM64.
>
> 2. Single CPU thread has limited copy throughput. Using multi threads is
> a natural extension to speed up folio copy, when DMA engine is NOT
> available in a system.
>
>
> Design
> ===
>
> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
> migrate_folio_move() and perform them in one shot afterwards. A
> copy_page_lists_mt() function is added to use multi threads to copy
> folios from src list to dst list.
>
> Changes compared to Shivank's patchset (mainly rewrote batching folio
> copy code)
> ===
>
> 1. mig_info is removed, so no memory allocation is needed during
> batching folio copies. src->private is used to store old page state and
> anon_vma after folio metadata is copied from src to dst.
>
> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
> redundant code in migrate_folios_batch_move().
>
> 3. folio_mc_copy() is used for the single threaded copy code to keep the
> original kernel behavior.
>
>
> Performance
> ===
>
> I benchmarked move_pages() throughput on a two socket NUMA system with two
> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
> mTHP page migration are measured.
>
> The tables below show move_pages() throughput with different
> configurations and different numbers of copied pages. The x-axis is the
> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
> threads with this patchset applied. And the unit is GB/s.
>
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
>
> 64KB (GB/s):
>
>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
> 32              5.43    4.90    5.65    7.31    7.60    8.61    6.43
> 256             6.95    6.89    9.28    14.67   22.41   23.39   23.93
> 512             7.88    7.26    10.15   17.53   27.82   27.88   33.93
> 768             7.65    7.42    10.46   18.59   28.65   29.67   30.76
> 1024    7.46    8.01    10.90   17.77   27.04   32.18   38.80
>
> 2MB mTHP (GB/s):
>
>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
> 1               5.94    2.90    6.90    8.56    11.16   8.76    6.41
> 2               7.67    5.57    7.11    12.48   17.37   15.68   14.10
> 4               8.01    6.04    10.25   20.14   22.52   27.79   25.28
> 8               8.42    7.00    11.41   24.73   33.96   32.62   39.55
> 16              9.41    6.91    12.23   27.51   43.95   49.15   51.38
> 32              10.23   7.15    13.03   29.52   49.49   69.98   71.51
> 64              9.40    7.37    13.88   30.38   52.00   76.89   79.41
> 128             8.59    7.23    14.20   28.39   49.98   78.27   90.18
> 256             8.43    7.16    14.59   28.14   48.78   76.88   92.28
> 512             8.31    7.78    14.40   26.20   43.31   63.91   75.21
> 768             8.30    7.86    14.83   27.41   46.25   69.85   81.31
> 1024    8.31    7.90    14.96   27.62   46.75   71.76   83.84

Is this done on an idle system or a busy system? For real production
workloads, all the CPUs are likely busy. It would be great to have the
performance data collected from a busys system too.

>
>
> TODOs
> ===
> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
> only use idle CPUs to avoid interfering userspace workloads. Of course
> more complicated policies can be used based on migration issuing thread
> priority.

The other potential problem is it is hard to attribute cpu time
consumed by the migration work threads to cpu cgroups. In a
multi-tenant environment this may result in unfair cpu time counting.
However, it is a chronic problem to properly count cpu time for kernel
threads. I'm not sure whether it has been solved or not.

>
> 2. Eliminate memory allocation during multi-threaded folio copy routine
> if possible.
>
> 3. A runtime check to decide when use multi-threaded folio copy.
> Something like cache hotness issue mentioned by Matthew[3].
>
> 4. Use non-temporal CPU instructions to avoid cache pollution issues.

AFAICT, arm64 already uses non-temporal instructions for copy page.

>
> 5. Explicitly make multi-threaded folio copy only available to
> !HIGHMEM, since kmap_local_page() would be needed for each kernel
> folio copy work threads and expensive.
>
> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
> to be used as well.
>
> Let me know your thoughts. Thanks.
>
>
> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@xxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/
> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@xxxxxxxxxxxxxxxxxxxx/
>
> Byungchul Park (1):
>   mm: separate move/undo doing on folio list from migrate_pages_batch()
>
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>     operations
>   mm/migrate: introduce multi-threaded page copy routine
>   test: add sysctl for folio copy tests and adjust
>     NR_MAX_BATCHED_MIGRATION
>
>  include/linux/migrate.h      |   3 +
>  include/linux/migrate_mode.h |   2 +
>  include/linux/mm.h           |   4 +
>  include/linux/sysctl.h       |   1 +
>  kernel/sysctl.c              |  29 ++-
>  mm/Makefile                  |   2 +-
>  mm/copy_pages.c              | 190 +++++++++++++++
>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>  8 files changed, 577 insertions(+), 97 deletions(-)
>  create mode 100644 mm/copy_pages.c
>
> --
> 2.45.2
>