Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3 Jan 2025, at 17:09, Yang Shi wrote:

> On Fri, Jan 3, 2025 at 9:24 AM Zi Yan <ziy@xxxxxxxxxx> wrote:
>>
>> Hi all,
>>
>> This patchset accelerates page migration by batching folio copy operations and
>> using multiple CPU threads and is based on Shivank's Enhancements to Page
>> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
>> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
>> The last patch is for testing purpose and should not be considered.
>>
>> The motivations are:
>>
>> 1. Batching folio copy increases copy throughput. Especially for base page
>> migrations, folio copy throughput is low since there are kernel activities like
>> moving folio metadata and updating page table entries sit between two folio
>> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
>> and 64KB on ARM64.
>>
>> 2. Single CPU thread has limited copy throughput. Using multi threads is
>> a natural extension to speed up folio copy, when DMA engine is NOT
>> available in a system.
>>
>>
>> Design
>> ===
>>
>> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
>> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
>> migrate_folio_move() and perform them in one shot afterwards. A
>> copy_page_lists_mt() function is added to use multi threads to copy
>> folios from src list to dst list.
>>
>> Changes compared to Shivank's patchset (mainly rewrote batching folio
>> copy code)
>> ===
>>
>> 1. mig_info is removed, so no memory allocation is needed during
>> batching folio copies. src->private is used to store old page state and
>> anon_vma after folio metadata is copied from src to dst.
>>
>> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
>> redundant code in migrate_folios_batch_move().
>>
>> 3. folio_mc_copy() is used for the single threaded copy code to keep the
>> original kernel behavior.
>>
>>
>> Performance
>> ===
>>
>> I benchmarked move_pages() throughput on a two socket NUMA system with two
>> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
>> mTHP page migration are measured.
>>
>> The tables below show move_pages() throughput with different
>> configurations and different numbers of copied pages. The x-axis is the
>> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
>> threads with this patchset applied. And the unit is GB/s.
>>
>> The 32-thread copy throughput can be up to 10x of single thread serial folio
>> copy. Batching folio copy not only benefits huge page but also base
>> page.
>>
>> 64KB (GB/s):
>>
>>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
>> 32              5.43    4.90    5.65    7.31    7.60    8.61    6.43
>> 256             6.95    6.89    9.28    14.67   22.41   23.39   23.93
>> 512             7.88    7.26    10.15   17.53   27.82   27.88   33.93
>> 768             7.65    7.42    10.46   18.59   28.65   29.67   30.76
>> 1024    7.46    8.01    10.90   17.77   27.04   32.18   38.80
>>
>> 2MB mTHP (GB/s):
>>
>>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
>> 1               5.94    2.90    6.90    8.56    11.16   8.76    6.41
>> 2               7.67    5.57    7.11    12.48   17.37   15.68   14.10
>> 4               8.01    6.04    10.25   20.14   22.52   27.79   25.28
>> 8               8.42    7.00    11.41   24.73   33.96   32.62   39.55
>> 16              9.41    6.91    12.23   27.51   43.95   49.15   51.38
>> 32              10.23   7.15    13.03   29.52   49.49   69.98   71.51
>> 64              9.40    7.37    13.88   30.38   52.00   76.89   79.41
>> 128             8.59    7.23    14.20   28.39   49.98   78.27   90.18
>> 256             8.43    7.16    14.59   28.14   48.78   76.88   92.28
>> 512             8.31    7.78    14.40   26.20   43.31   63.91   75.21
>> 768             8.30    7.86    14.83   27.41   46.25   69.85   81.31
>> 1024    8.31    7.90    14.96   27.62   46.75   71.76   83.84
>
> Is this done on an idle system or a busy system? For real production
> workloads, all the CPUs are likely busy. It would be great to have the
> performance data collected from a busys system too.

Yes, it was done on an idle system.

I redid the experiments on a busy system by running stress on all CPU
cores and the results are as not good, since all CPUs are occupied.
Then I switched to system_highpri_wq, the throughput got better,
almost on par with the results on an idle machine. The numbers are
below.

It becomes a trade-off between page migration throughput vs user
application performance on _a busy system_. If a page migration is badly
needed, system_highpri_wq can be used to retain high copy throughput.
Otherwise, multithreads should not be used.

64KB with system_unbound_wq on a busy system (GB/s):

| ---- | -------- | ---- | ---- | ---- | ---- | ----- | ----- |
|      | vanilla  | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | mt_32 |
| ---- | -------- | ---- | ---- | ---- | ---- | ----- | ----- |
| 32   | 4.05     | 1.51 | 1.32 | 1.20 | 4.31 | 1.05  | 0.02  |
| 256  | 6.91     | 3.93 | 4.61 | 0.08 | 4.46 | 4.30  | 3.89  |
| 512  | 7.28     | 4.87 | 1.81 | 6.18 | 4.38 | 5.58  | 6.10  |
| 768  | 4.57     | 5.72 | 5.35 | 5.24 | 5.94 | 5.66  | 0.20  |
| 1024 | 7.88     | 5.73 | 5.81 | 6.52 | 7.29 | 6.06  | 5.62  |

2MB with system_unbound_wq on a busy system  (GB/s):

| ---- | ------- | ---- | ---- | ---- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4 | mt_8  | mt_16 | mt_32 |
| ---- | ------- | ---- | ---- | ---- | ----- | ----- | ----- |
| 1    | 1.38    | 0.59 | 1.45 | 1.99 | 1.59  | 2.18  | 1.48  |
| 2    | 1.13    | 3.08 | 3.11 | 1.85 | 0.32  | 1.46  | 2.53  |
| 4    | 8.31    | 4.02 | 5.68 | 3.22 | 2.96  | 5.77  | 2.91  |
| 8    | 8.16    | 5.09 | 1.19 | 4.96 | 4.50  | 3.36  | 4.99  |
| 16   | 3.47    | 5.13 | 5.72 | 7.06 | 5.90  | 6.49  | 5.34  |
| 32   | 8.42    | 6.97 | 0.13 | 6.77 | 7.69  | 7.56  | 2.87  |
| 64   | 7.45    | 8.06 | 7.22 | 8.60 | 8.07  | 7.16  | 0.57  |
| 128  | 7.77    | 7.93 | 7.29 | 8.31 | 7.77  | 9.05  | 0.92  |
| 256  | 6.91    | 7.20 | 6.80 | 8.56 | 7.81  | 10.13 | 11.21 |
| 512  | 6.72    | 7.22 | 7.77 | 9.71 | 10.68 | 10.35 | 10.40 |
| 768  | 6.87    | 7.18 | 7.98 | 9.28 | 10.85 | 10.83 | 14.17 |
| 1024 | 6.95    | 7.23 | 8.03 | 9.59 | 10.88 | 10.22 | 20.27 |



64KB with system_highpri_wq on a busy system  (GB/s):

| ---- | ------- | ---- | ---- | ----- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 | mt_32 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- | ----- |
| 32   | 4.05    | 2.63 | 1.62 | 1.90  | 3.34  | 3.71  | 3.40  |
| 256  | 6.91    | 5.16 | 4.33 | 8.07  | 6.81  | 10.31 | 13.51 |
| 512  | 7.28    | 4.89 | 6.43 | 15.72 | 11.31 | 18.03 | 32.69 |
| 768  | 4.57    | 6.27 | 6.42 | 11.06 | 8.56  | 14.91 | 9.24  |
| 1024 | 7.88    | 6.73 | 0.49 | 17.09 | 19.34 | 23.60 | 18.12 |


2MB with system_highpri_wq on a busy system  (GB/s):

| ---- | ------- | ---- | ----- | ----- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2  | mt_4  | mt_8  | mt_16 | mt_32 |
| ---- | ------- | ---- | ----- | ----- | ----- | ----- | ----- |
| 1    | 1.38    | 1.18 | 1.17  | 5.00  | 1.68  | 3.86  | 2.46  |
| 2    | 1.13    | 1.78 | 1.05  | 0.01  | 3.52  | 1.84  | 1.80  |
| 4    | 8.31    | 3.91 | 5.24  | 4.30  | 4.12  | 2.93  | 3.44  |
| 8    | 8.16    | 6.09 | 3.67  | 7.81  | 11.10 | 8.47  | 15.21 |
| 16   | 3.47    | 6.02 | 8.44  | 11.80 | 9.56  | 12.84 | 9.81  |
| 32   | 8.42    | 7.34 | 10.10 | 13.79 | 23.03 | 26.68 | 45.24 |
| 64   | 7.45    | 7.90 | 12.27 | 19.99 | 36.08 | 35.11 | 60.26 |
| 128  | 7.77    | 7.57 | 13.35 | 24.67 | 35.03 | 41.40 | 51.68 |
| 256  | 6.91    | 7.40 | 14.13 | 25.37 | 38.83 | 62.18 | 51.37 |
| 512  | 6.72    | 7.26 | 14.72 | 27.37 | 43.99 | 66.84 | 69.63 |
| 768  | 6.87    | 7.29 | 14.84 | 26.34 | 47.21 | 67.51 | 80.32 |
| 1024 | 6.95    | 7.26 | 14.88 | 26.98 | 47.75 | 74.99 | 85.00 |



>
>>
>>
>> TODOs
>> ===
>> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
>> only use idle CPUs to avoid interfering userspace workloads. Of course
>> more complicated policies can be used based on migration issuing thread
>> priority.
>
> The other potential problem is it is hard to attribute cpu time
> consumed by the migration work threads to cpu cgroups. In a
> multi-tenant environment this may result in unfair cpu time counting.
> However, it is a chronic problem to properly count cpu time for kernel
> threads. I'm not sure whether it has been solved or not.
>
>>
>> 2. Eliminate memory allocation during multi-threaded folio copy routine
>> if possible.
>>
>> 3. A runtime check to decide when use multi-threaded folio copy.
>> Something like cache hotness issue mentioned by Matthew[3].
>>
>> 4. Use non-temporal CPU instructions to avoid cache pollution issues.
>
> AFAICT, arm64 already uses non-temporal instructions for copy page.

Right. My current implementation uses memcpy, which does not use non-temporal
on ARM64, since a huge page can be copied by multiple threads. A non-temporal
memcpy can be added for this use.

Thank you for the inputs.

>
>>
>> 5. Explicitly make multi-threaded folio copy only available to
>> !HIGHMEM, since kmap_local_page() would be needed for each kernel
>> folio copy work threads and expensive.
>>
>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>> to be used as well.
>>
>> Let me know your thoughts. Thanks.
>>
>>
>> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@xxxxxxx/
>> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/
>> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@xxxxxxxxxxxxxxxxxxxx/
>>
>> Byungchul Park (1):
>>   mm: separate move/undo doing on folio list from migrate_pages_batch()
>>
>> Zi Yan (4):
>>   mm/migrate: factor out code in move_to_new_folio() and
>>     migrate_folio_move()
>>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>>     operations
>>   mm/migrate: introduce multi-threaded page copy routine
>>   test: add sysctl for folio copy tests and adjust
>>     NR_MAX_BATCHED_MIGRATION
>>
>>  include/linux/migrate.h      |   3 +
>>  include/linux/migrate_mode.h |   2 +
>>  include/linux/mm.h           |   4 +
>>  include/linux/sysctl.h       |   1 +
>>  kernel/sysctl.c              |  29 ++-
>>  mm/Makefile                  |   2 +-
>>  mm/copy_pages.c              | 190 +++++++++++++++
>>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>>  8 files changed, 577 insertions(+), 97 deletions(-)
>>  create mode 100644 mm/copy_pages.c
>>
>> --
>> 2.45.2
>>


--
Best Regards,
Yan, Zi




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux