Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

Byungchul Park <byungchul@xxxxxx> · Thu, 13 Feb 2025 17:17:06 +0900

On Fri, Jan 03, 2025 at 12:24:14PM -0500, Zi Yan wrote:
> Hi all,

Hi,

It'd be appreciated to cc me from the next.

	Byungchul
> 
> This patchset accelerates page migration by batching folio copy operations and
> using multiple CPU threads and is based on Shivank's Enhancements to Page
> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
> The last patch is for testing purpose and should not be considered.
> 
> The motivations are:
> 
> 1. Batching folio copy increases copy throughput. Especially for base page
> migrations, folio copy throughput is low since there are kernel activities like
> moving folio metadata and updating page table entries sit between two folio
> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
> and 64KB on ARM64.
> 
> 2. Single CPU thread has limited copy throughput. Using multi threads is
> a natural extension to speed up folio copy, when DMA engine is NOT
> available in a system.
> 
> 
> Design
> ===
> 
> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
> migrate_folio_move() and perform them in one shot afterwards. A
> copy_page_lists_mt() function is added to use multi threads to copy
> folios from src list to dst list.
> 
> Changes compared to Shivank's patchset (mainly rewrote batching folio
> copy code)
> ===
> 
> 1. mig_info is removed, so no memory allocation is needed during
> batching folio copies. src->private is used to store old page state and
> anon_vma after folio metadata is copied from src to dst.
> 
> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
> redundant code in migrate_folios_batch_move().
> 
> 3. folio_mc_copy() is used for the single threaded copy code to keep the
> original kernel behavior.
> 
> 
> Performance
> ===
> 
> I benchmarked move_pages() throughput on a two socket NUMA system with two
> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
> mTHP page migration are measured.
> 
> The tables below show move_pages() throughput with different
> configurations and different numbers of copied pages. The x-axis is the
> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
> threads with this patchset applied. And the unit is GB/s.
> 
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
> 
> 64KB (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
> 256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
> 512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
> 768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
> 1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80
> 
> 2MB mTHP (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
> 2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
> 4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
> 8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
> 16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
> 32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
> 64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
> 128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
> 256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
> 512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
> 768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
> 1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84
> 
> 
> TODOs
> ===
> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
> only use idle CPUs to avoid interfering userspace workloads. Of course
> more complicated policies can be used based on migration issuing thread
> priority.
> 
> 2. Eliminate memory allocation during multi-threaded folio copy routine
> if possible.
> 
> 3. A runtime check to decide when use multi-threaded folio copy.
> Something like cache hotness issue mentioned by Matthew[3].
> 
> 4. Use non-temporal CPU instructions to avoid cache pollution issues.
> 
> 5. Explicitly make multi-threaded folio copy only available to
> !HIGHMEM, since kmap_local_page() would be needed for each kernel
> folio copy work threads and expensive.
> 
> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
> to be used as well.
> 
> Let me know your thoughts. Thanks.
> 
> 
> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@xxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/
> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@xxxxxxxxxxxxxxxxxxxx/
> 
> Byungchul Park (1):
>   mm: separate move/undo doing on folio list from migrate_pages_batch()
> 
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>     operations
>   mm/migrate: introduce multi-threaded page copy routine
>   test: add sysctl for folio copy tests and adjust
>     NR_MAX_BATCHED_MIGRATION
> 
>  include/linux/migrate.h      |   3 +
>  include/linux/migrate_mode.h |   2 +
>  include/linux/mm.h           |   4 +
>  include/linux/sysctl.h       |   1 +
>  kernel/sysctl.c              |  29 ++-
>  mm/Makefile                  |   2 +-
>  mm/copy_pages.c              | 190 +++++++++++++++
>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>  8 files changed, 577 insertions(+), 97 deletions(-)
>  create mode 100644 mm/copy_pages.c
> 
> -- 
> 2.45.2
>