Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

Shivank Garg <shivankg@xxxxxxx> · Tue, 21 Jan 2025 11:45:30 +0530

On 1/16/2025 10:27 AM, Shivank Garg wrote:
> On 1/11/2025 1:21 AM, Zi Yan wrote:
> <snip>
> 
> 
>>>> BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(),
>>>> which would incur a huge overhead, based on my past experience using DMA engine
>>>> for page copy. I know it is needed to make sure DMA is still present, but
>>>> its cost needs to be minimized to make DMA folio copy usable. Otherwise,
>>>> the 768MB/s DMA copy throughput from your cover letter cannot convince people
>>>> to use it for page migration, since single CPU can achieve more than that,
>>>> as you showed in the table below.
> 
> Thank you for pointing this.
> I'm learning about DMAEngine and will look more into DMA driver part.
> 
>>>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>>>> but I'm still missing the performance scaling you observed.
>>>>
>>>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>>>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>>>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>>>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>>>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>>>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>>>> link bandwidth limited.
> 
> I tested the cross-socket bandwidth on my EPYC Zen 5 system and easily getting >10X
> bandwidth as this. I don't think BW is a issue here.
> 
> 
>>>>
>>>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>>>> with more threads between two sockets, maybe part of the reason is that
>>>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>>>
>>> I talked to my colleague about this and he mentioned about CCD architecture
>>> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
>>> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
>>> another. This means my naive scheduling algorithm, which use CPUs from
>>> 0 to N threads, uses all cores from one CDD first, then move to another
>>> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
>>> sense to you?
>>>
>>> If yes, can you please change the my cpu selection code in mm/copy_pages.c:
> 
> This is making sense.
> 
> I first tried distributing work threads across different CCDs, which yielded
> better results.
> 
> Also, I switched my system to NPS-2 config (2 Nodes per socket). This was done
> to eliminate cross-socket connections and variables by focusing on intra-socket
> page migrations.
> 
> Cross-Socket (Node 0 -> Node 2)
> THP Always (2 MB pages)
> 
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 262144              12.37     12.52     15.72     24.94     30.40     33.23     34.68     29.67
> 524288              12.44     12.19     15.70     24.96     32.72     33.40     35.40     29.18
> 
> Intra-Socket (Node 0 -> Node 1)
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 262144              12.37     17.10     18.65     26.05     35.56     37.80     33.73     29.29
> 524288              12.44     16.73     18.87     24.34     35.63     37.49     33.79     29.76
> 
> I have temporarily hardcoded the CPU assignments and will work on improving the
> CPU selection code.
>>>
>>> +	/* TODO: need a better cpu selection method */
>>> +	for_each_cpu(cpu, per_node_cpumask) {
>>> +		if (i >= total_mt_num)
>>> +			break;
>>> +		cpu_id_list[i] = cpu;
>>> +		++i;
>>> +	}
>>>
>>> to select CPUs from as many CCDs as possible and rerun the tests.
>>> That might boost the page migration throughput on AMD CPUs more.
>>>
>>> Thanks.
>>>
>>>>>
>>>>> THP Never
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>>>>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>>>>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>>>>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>>>>
>>>>> THP Always
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>>>>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>>>>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>>>>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>>>>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>>>>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>>>>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>>>>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>>>>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>>>>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>>>>
>>>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study
>>
>>
>> BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
>> The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
>> vanilla kernel throughput using 8 or 16 threads.
>>
>>
>> 4KB (GB/s)
>>
>> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
>> |      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
>> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
>> | 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
>> | 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
>> | 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
>> | 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
>> | 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |
>>
>>
>>
>> 2MB (GB/s)
>> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
>> |      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
>> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
>> | 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
>> | 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
>> | 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
>> | 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
>> | 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
>> | 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
>> | 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
>> | 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
>> | 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
>> | 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
>> | 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
>> | 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
>>
> 
> I see, thank you for checking.
> 
> Meanwhile, I'll continue to explore for performance optimization
> avenues.

Hi Zi,

I experimented with your testcase[1] to get 2-2.5X throughput gain than my previous
experiment. Multi-threading scaling for 32 threads is ~4X (slightly higher than my
previous experiment).

The main difference between our move_pages throughput benchmarks was:
You're explicitly using THP using aligned_alloc and MADV_HUGEPAGE whereas
I'm relying on system THP operating on 4KB boundaries.

While both methods use THP and expected similar performance, we saw lower
throughput in my case because:

In my test code,
The kernel processes all 512 4KB folios within a 2MB region in the first migration
attempt. For subsequent folios, __add_folio_for_migration() returns early with
folio_nid(folio) == node, as the pages are already on the target node.
This adds extra overheads of vma_lookup and folio_walk_start() in my experiment.

2MB pages (GB/s):
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
1                   10.74     11.04     4.68      8.17      6.47      6.09      3.97      6.20
2                   12.44     4.90      11.19     14.10     15.33     8.45      10.09     9.97
4                   14.82     9.80      11.93     18.35     21.82     17.09     10.53     7.51
8                   16.13     9.91      15.26     11.85     26.53     13.09     12.71     13.75
16                  15.99     8.81      13.84     22.43     33.89     11.91     12.30     13.26
32                  14.03     11.37     17.54     23.96     57.07     18.78     19.51     21.29
64                  15.79     9.55      22.19     33.17     57.18     65.51     55.39     62.53
128                 18.22     16.65     21.49     30.73     52.99     61.05     58.44     60.38
256                 19.78     20.56     24.72     34.94     56.73     71.11     61.83     62.77
512                 20.27     21.40     27.47     39.23     65.72     67.97     70.48     71.39
1024                20.48     21.48     27.48     38.30     68.62     77.94     78.00     78.95

[1]: https://github.com/x-y-z/thp-migration-bench/blob/arm64/move_thp.c

> 
> Best Regards,
> Shivank
>>
>>
>> Best Regards,
>> Yan, Zi
>>
>