Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

Zi Yan <ziy@xxxxxxxxxx> · Fri, 10 Jan 2025 12:05:29 -0500

<snip>
>>
>>>> main() {
>>>> ...
>>>>
>>>>     // code snippet to measure throughput
>>>>     clock_gettime(CLOCK_MONOTONIC, &t1);
>>>>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>>>>     clock_gettime(CLOCK_MONOTONIC, &t2);
>>>>
>>>>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>>>>
>>>> ...
>>>> }
>>>>
>>>>
>>>> Measurements:
>>>> ============
>>>> vanilla: base kernel without patchset
>>>> mt:0 = MT kernel with use_mt_copy=0
>>>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>>>>
>>>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
>>>> for 4KB migration and THP migration.
>>>>
>>>> --------------------
>>>> #1 push_0_pull_1 = 0 (src node CPUs are used)
>>>>
>>>> #1.1 THP=Never, 4KB (GB/s):
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
>>>> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
>>>> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
>>>> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>>>>
>>>> #1.2 THP=Always, 2MB (GB/s):
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
>>>> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
>>>> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
>>>> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
>>>> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
>>>> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
>>>> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
>>>> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
>>>> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
>>>> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>>>>
>>>> --------------------
>>>> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>>>>
>>>> #2.1 THP=Never 4KB (GB/s):
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
>>>> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
>>>> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
>>>> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>>>>
>>>> #2.2 THP=Always 2MB (GB/s):
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
>>>> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
>>>> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
>>>> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
>>>> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
>>>> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
>>>> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
>>>> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
>>>> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
>>>> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>>>>
>>>> Note:
>>>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>>>>    experiment with 64KB pagesize)
>>>> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>>>>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>>>>
>>>>
>>>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
>>>> relatively flat across thread counts.
>>>>
>>>> Is it possible I'm missing something in my testing?
>>>>
>>>> Could the base page size difference (4KB vs 64KB) be playing a role in
>>>> the scaling behavior? How the performance varies with 4KB pages on your system?
>>>>
>>>> I'd be happy to work with you on investigating this differences.
>>>> Let me know if you'd like any additional test data or if there are specific
>>>> configurations I should try.
>>>
>>> The results surprises me, since I was able to achieve ~9GB/s when migrating
>>> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
>>> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
>>> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
>>> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
>>> not make sense.
>>>
>>> One thing you might want to try is to set init_on_alloc=0 in your boot
>>> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
>>> might reduce the time spent on page zeros.
>>>
>>> I am also going to rerun the experiments locally on x86_64 boxes to see if your
>>> results can be replicated.
>>>
>>> Thank you for the review and running these experiments. I really appreciate
>>> it.>
>>>
>>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/
>>>
>>
>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>> but I'm still missing the performance scaling you observed.
>
> It might be the difference between x86 and ARM64, but I am not 100% sure.
> Based on your data below, 2 or 4 threads seem to the sweep spot for
> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
> ~25GB/s bidirectional. I wonder if your results below are cross-socket
> link bandwidth limited.
>
> From my results, NVIDIA Grace CPU can achieve high copy throughput
> with more threads between two sockets, maybe part of the reason is that
> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.

I talked to my colleague about this and he mentioned about CCD architecture
on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
another. This means my naive scheduling algorithm, which use CPUs from
0 to N threads, uses all cores from one CDD first, then move to another
CCD. It is not able to saturate the cross-socket bandwidth. Does it make
sense to you?

If yes, can you please change the my cpu selection code in mm/copy_pages.c:

+	/* TODO: need a better cpu selection method */
+	for_each_cpu(cpu, per_node_cpumask) {
+		if (i >= total_mt_num)
+			break;
+		cpu_id_list[i] = cpu;
+		++i;
+	}

to select CPUs from as many CCDs as possible and rerun the tests.
That might boost the page migration throughput on AMD CPUs more.

Thanks.

>>
>> THP Never
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>
>> THP Always
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>
> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study

Best Regards,
Yan, Zi