Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

Shivank Garg <shivankg@xxxxxxx> · Thu, 9 Jan 2025 23:33:57 +0530

On 1/9/2025 8:34 PM, Zi Yan wrote:
> On 9 Jan 2025, at 6:47, Shivank Garg wrote:
> 
>> On 1/3/2025 10:54 PM, Zi Yan wrote:
>>

>>>
>>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>>> to be used as well.
>>
>> I think Static Calls can be better option for this.
> 
> This is the first time I hear about it. Based on the info I find, I agree
> it is a great mechanism to switch between two methods globally.
>>
>> This will give a flexible copy interface to support both CPU and various DMA-based
>> folio copy. DMA-capable driver can override the default CPU copy path without any
>> additional runtime overheads.
> 
> Yes, supporting DMA-based folio copy is also my intention too. I am happy to
> with you on that. Things to note are:
> 1. DMA engine should have more copy throughput as a single CPU thread, otherwise
> the scatter-gather setup overheads will eliminate the benefit of using DMA engine.

I agree on this.

> 2. Unless the DMA engine is really beef and can handle all possible page migration
> requests, CPU-based migration (single or multi threads) should be a fallback.
> 
> In terms of 2, I wonder how much overheads does Static Calls have when switching
> between functions. Also, a lock might be needed since falling back to CPU might
> be per migrate_pages(). Considering these two, Static Calls might not work
> as you intended if switching between CPU and DMA is needed.

You can check Patch 4/5 and 5/5 for static call implementation for using DMA Driver
https://lore.kernel.org/linux-mm/20240614221525.19170-5-shivankg@xxxxxxx

There are no run-time overheads of this Static call approach as update happens only
during DMA driver registration/un-registration - dma_update_migrator()
The SRCU synchronization will ensure the safety during updates.

It'll use static_call(_folios_copy)() for the copy path. A wrapper inside the DMA can
ensure it fallback to folios_copy().

Does this address your concern regarding the 2?

>> main() {
>> ...
>>
>>     // code snippet to measure throughput
>>     clock_gettime(CLOCK_MONOTONIC, &t1);
>>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>>     clock_gettime(CLOCK_MONOTONIC, &t2);
>>
>>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>>
>> ...
>> }
>>
>>
>> Measurements:
>> ============
>> vanilla: base kernel without patchset
>> mt:0 = MT kernel with use_mt_copy=0
>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>>
>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
>> for 4KB migration and THP migration.
>>
>> --------------------
>> #1 push_0_pull_1 = 0 (src node CPUs are used)
>>
>> #1.1 THP=Never, 4KB (GB/s):
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
>> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
>> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
>> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>>
>> #1.2 THP=Always, 2MB (GB/s):
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
>> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
>> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
>> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
>> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
>> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
>> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
>> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
>> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
>> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>>
>> --------------------
>> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>>
>> #2.1 THP=Never 4KB (GB/s):
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
>> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
>> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
>> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>>
>> #2.2 THP=Always 2MB (GB/s):
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
>> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
>> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
>> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
>> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
>> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
>> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
>> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
>> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
>> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>>
>> Note:
>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>>    experiment with 64KB pagesize)
>> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>>
>>
>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
>> relatively flat across thread counts.
>>
>> Is it possible I'm missing something in my testing?
>>
>> Could the base page size difference (4KB vs 64KB) be playing a role in
>> the scaling behavior? How the performance varies with 4KB pages on your system?
>>
>> I'd be happy to work with you on investigating this differences.
>> Let me know if you'd like any additional test data or if there are specific
>> configurations I should try.
> 
> The results surprises me, since I was able to achieve ~9GB/s when migrating
> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
> not make sense.
> 
> One thing you might want to try is to set init_on_alloc=0 in your boot
> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
> might reduce the time spent on page zeros.
> 
> I am also going to rerun the experiments locally on x86_64 boxes to see if your
> results can be replicated.
> 
> Thank you for the review and running these experiments. I really appreciate
> it.> 
> 
> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/
> 

Using init_on_alloc=0 gave significant performance gain over the last experiment
but I'm still missing the performance scaling you observed.

THP Never
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41

THP Always
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29

Thanks,
Shivank

> Best Regards,
> Yan, Zi
>