Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

Zi Yan <ziy@xxxxxxxxxx> · Fri, 10 Jan 2025 14:51:01 -0500

On 10 Jan 2025, at 12:05, Zi Yan wrote:

> <snip>
>>>
>>>>> main() {
>>>>> ...
>>>>>
>>>>>     // code snippet to measure throughput
>>>>>     clock_gettime(CLOCK_MONOTONIC, &t1);
>>>>>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>>>>>     clock_gettime(CLOCK_MONOTONIC, &t2);
>>>>>
>>>>>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>>>>>
>>>>> ...
>>>>> }
>>>>>
>>>>>
>>>>> Measurements:
>>>>> ============
>>>>> vanilla: base kernel without patchset
>>>>> mt:0 = MT kernel with use_mt_copy=0
>>>>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>>>>>
>>>>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
>>>>> for 4KB migration and THP migration.
>>>>>
>>>>> --------------------
>>>>> #1 push_0_pull_1 = 0 (src node CPUs are used)
>>>>>
>>>>> #1.1 THP=Never, 4KB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
>>>>> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
>>>>> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
>>>>> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>>>>>
>>>>> #1.2 THP=Always, 2MB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
>>>>> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
>>>>> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
>>>>> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
>>>>> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
>>>>> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
>>>>> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
>>>>> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
>>>>> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
>>>>> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>>>>>
>>>>> --------------------
>>>>> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>>>>>
>>>>> #2.1 THP=Never 4KB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
>>>>> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
>>>>> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
>>>>> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>>>>>
>>>>> #2.2 THP=Always 2MB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
>>>>> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
>>>>> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
>>>>> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
>>>>> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
>>>>> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
>>>>> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
>>>>> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
>>>>> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
>>>>> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>>>>>
>>>>> Note:
>>>>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>>>>>    experiment with 64KB pagesize)
>>>>> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>>>>>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>>>>>
>>>>>
>>>>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
>>>>> relatively flat across thread counts.
>>>>>
>>>>> Is it possible I'm missing something in my testing?
>>>>>
>>>>> Could the base page size difference (4KB vs 64KB) be playing a role in
>>>>> the scaling behavior? How the performance varies with 4KB pages on your system?
>>>>>
>>>>> I'd be happy to work with you on investigating this differences.
>>>>> Let me know if you'd like any additional test data or if there are specific
>>>>> configurations I should try.
>>>>
>>>> The results surprises me, since I was able to achieve ~9GB/s when migrating
>>>> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
>>>> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
>>>> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
>>>> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
>>>> not make sense.
>>>>
>>>> One thing you might want to try is to set init_on_alloc=0 in your boot
>>>> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
>>>> might reduce the time spent on page zeros.
>>>>
>>>> I am also going to rerun the experiments locally on x86_64 boxes to see if your
>>>> results can be replicated.
>>>>
>>>> Thank you for the review and running these experiments. I really appreciate
>>>> it.>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/
>>>>
>>>
>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>> but I'm still missing the performance scaling you observed.
>>
>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>> link bandwidth limited.
>>
>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>> with more threads between two sockets, maybe part of the reason is that
>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>
> I talked to my colleague about this and he mentioned about CCD architecture
> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
> another. This means my naive scheduling algorithm, which use CPUs from
> 0 to N threads, uses all cores from one CDD first, then move to another
> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
> sense to you?
>
> If yes, can you please change the my cpu selection code in mm/copy_pages.c:
>
> +	/* TODO: need a better cpu selection method */
> +	for_each_cpu(cpu, per_node_cpumask) {
> +		if (i >= total_mt_num)
> +			break;
> +		cpu_id_list[i] = cpu;
> +		++i;
> +	}
>
> to select CPUs from as many CCDs as possible and rerun the tests.
> That might boost the page migration throughput on AMD CPUs more.
>
> Thanks.
>
>>>
>>> THP Never
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>>
>>> THP Always
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>>
>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study

BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
vanilla kernel throughput using 8 or 16 threads.

4KB (GB/s)

| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
| 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
| 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
| 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
| 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
| 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |

2MB (GB/s)
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
| 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
| 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
| 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
| 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
| 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
| 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
| 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
| 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
| 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
| 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
| 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
| 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |

Best Regards,
Yan, Zi