Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

Zi Yan <ziy@xxxxxxxxxx> · Thu, 09 Jan 2025 10:04:15 -0500

On 9 Jan 2025, at 6:47, Shivank Garg wrote:

> On 1/3/2025 10:54 PM, Zi Yan wrote:
>
> Hi Zi,
>
> It's interesting to see my batch page migration patchset evolution with
> multi-threading support. Thanks for sharing this.
>
>> Hi all,
>>
>> This patchset accelerates page migration by batching folio copy operations and
>> using multiple CPU threads and is based on Shivank's Enhancements to Page
>> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
>> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
>> The last patch is for testing purpose and should not be considered.
>>
>> The motivations are:
>>
>> 1. Batching folio copy increases copy throughput. Especially for base page
>> migrations, folio copy throughput is low since there are kernel activities like
>> moving folio metadata and updating page table entries sit between two folio
>> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
>> and 64KB on ARM64.
>>
>> 2. Single CPU thread has limited copy throughput. Using multi threads is
>> a natural extension to speed up folio copy, when DMA engine is NOT
>> available in a system.
>>
>>
>> Design
>> ===
>>
>> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
>> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
>> migrate_folio_move() and perform them in one shot afterwards. A
>> copy_page_lists_mt() function is added to use multi threads to copy
>> folios from src list to dst list.
>>
>> Changes compared to Shivank's patchset (mainly rewrote batching folio
>> copy code)
>> ===
>>
>> 1. mig_info is removed, so no memory allocation is needed during
>> batching folio copies. src->private is used to store old page state and
>> anon_vma after folio metadata is copied from src to dst.
>>
>> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
>> redundant code in migrate_folios_batch_move().
>>
>> 3. folio_mc_copy() is used for the single threaded copy code to keep the
>> original kernel behavior.
>>
>>
>
>
>>
>> TODOs
>> ===
>> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
>> only use idle CPUs to avoid interfering userspace workloads. Of course
>> more complicated policies can be used based on migration issuing thread
>> priority.
>>
>> 2. Eliminate memory allocation during multi-threaded folio copy routine
>> if possible.
>>
>> 3. A runtime check to decide when use multi-threaded folio copy.
>> Something like cache hotness issue mentioned by Matthew[3].
>>
>> 4. Use non-temporal CPU instructions to avoid cache pollution issues.
>
>>
>> 5. Explicitly make multi-threaded folio copy only available to
>> !HIGHMEM, since kmap_local_page() would be needed for each kernel
>> folio copy work threads and expensive.
>>
>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>> to be used as well.
>
> I think Static Calls can be better option for this.

This is the first time I hear about it. Based on the info I find, I agree
it is a great mechanism to switch between two methods globally.
>
> This will give a flexible copy interface to support both CPU and various DMA-based
> folio copy. DMA-capable driver can override the default CPU copy path without any
> additional runtime overheads.

Yes, supporting DMA-based folio copy is also my intention too. I am happy to
with you on that. Things to note are:
1. DMA engine should have more copy throughput as a single CPU thread, otherwise
the scatter-gather setup overheads will eliminate the benefit of using DMA engine.

2. Unless the DMA engine is really beef and can handle all possible page migration
requests, CPU-based migration (single or multi threads) should be a fallback.

In terms of 2, I wonder how much overheads does Static Calls have when switching
between functions. Also, a lock might be needed since falling back to CPU might
be per migrate_pages(). Considering these two, Static Calls might not work
as you intended if switching between CPU and DMA is needed.

>
>
>> Performance
>> ===
>>
>> I benchmarked move_pages() throughput on a two socket NUMA system with two
>> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
>> mTHP page migration are measured.
>>
>> The tables below show move_pages() throughput with different
>> configurations and different numbers of copied pages. The x-axis is the
>> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
>> threads with this patchset applied. And the unit is GB/s.
>>
>> The 32-thread copy throughput can be up to 10x of single thread serial folio
>> copy. Batching folio copy not only benefits huge page but also base
>> page.
>>
>> 64KB (GB/s):
>>
>> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
>> 32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
>> 256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
>> 512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
>> 768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
>> 1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80
>>
>> 2MB mTHP (GB/s):
>>
>> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
>> 1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
>> 2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
>> 4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
>> 8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
>> 16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
>> 32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
>> 64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
>> 128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
>> 256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
>> 512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
>> 768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
>> 1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84
>
> I'm measuring the throughput(in GB/s) on our AMD EPYC Zen 5 system
> (2-socket, 64-core per socket with SMT Enabled, 2 NUMA nodes) with base
> page-size as 4KB and using using mm-everything-2025-01-04-04-41 as base
> kernel.
>
> Method:
> ======
> main() {
> ...
>
>     // code snippet to measure throughput
>     clock_gettime(CLOCK_MONOTONIC, &t1);
>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>     clock_gettime(CLOCK_MONOTONIC, &t2);
>
>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>
> ...
> }
>
>
> Measurements:
> ============
> vanilla: base kernel without patchset
> mt:0 = MT kernel with use_mt_copy=0
> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>
> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
> for 4KB migration and THP migration.
>
> --------------------
> #1 push_0_pull_1 = 0 (src node CPUs are used)
>
> #1.1 THP=Never, 4KB (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>
> #1.2 THP=Always, 2MB (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>
> --------------------
> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>
> #2.1 THP=Never 4KB (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>
> #2.2 THP=Always 2MB (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>
> Note:
> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>    experiment with 64KB pagesize)
> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>
>
> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
> relatively flat across thread counts.
>
> Is it possible I'm missing something in my testing?
>
> Could the base page size difference (4KB vs 64KB) be playing a role in
> the scaling behavior? How the performance varies with 4KB pages on your system?
>
> I'd be happy to work with you on investigating this differences.
> Let me know if you'd like any additional test data or if there are specific
> configurations I should try.

The results surprises me, since I was able to achieve ~9GB/s when migrating
16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
(a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
not make sense.

One thing you might want to try is to set init_on_alloc=0 in your boot
parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
might reduce the time spent on page zeros.

I am also going to rerun the experiments locally on x86_64 boxes to see if your
results can be replicated.

Thank you for the review and running these experiments. I really appreciate
it.

[1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/

Best Regards,
Yan, Zi