Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

Shivank Garg <shivankg@xxxxxxx> · Thu, 16 Jan 2025 10:27:49 +0530

On 1/11/2025 1:21 AM, Zi Yan wrote:
<snip>

>>> BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(),
>>> which would incur a huge overhead, based on my past experience using DMA engine
>>> for page copy. I know it is needed to make sure DMA is still present, but
>>> its cost needs to be minimized to make DMA folio copy usable. Otherwise,
>>> the 768MB/s DMA copy throughput from your cover letter cannot convince people
>>> to use it for page migration, since single CPU can achieve more than that,
>>> as you showed in the table below.

Thank you for pointing this.
I'm learning about DMAEngine and will look more into DMA driver part.

>>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>>> but I'm still missing the performance scaling you observed.
>>>
>>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>>> link bandwidth limited.

I tested the cross-socket bandwidth on my EPYC Zen 5 system and easily getting >10X
bandwidth as this. I don't think BW is a issue here.

>>>
>>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>>> with more threads between two sockets, maybe part of the reason is that
>>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>>
>> I talked to my colleague about this and he mentioned about CCD architecture
>> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
>> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
>> another. This means my naive scheduling algorithm, which use CPUs from
>> 0 to N threads, uses all cores from one CDD first, then move to another
>> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
>> sense to you?
>>
>> If yes, can you please change the my cpu selection code in mm/copy_pages.c:

This is making sense.

I first tried distributing work threads across different CCDs, which yielded
better results.

Also, I switched my system to NPS-2 config (2 Nodes per socket). This was done
to eliminate cross-socket connections and variables by focusing on intra-socket
page migrations.

Cross-Socket (Node 0 -> Node 2)
THP Always (2 MB pages)

nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
262144              12.37     12.52     15.72     24.94     30.40     33.23     34.68     29.67
524288              12.44     12.19     15.70     24.96     32.72     33.40     35.40     29.18

Intra-Socket (Node 0 -> Node 1)
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
262144              12.37     17.10     18.65     26.05     35.56     37.80     33.73     29.29
524288              12.44     16.73     18.87     24.34     35.63     37.49     33.79     29.76

I have temporarily hardcoded the CPU assignments and will work on improving the
CPU selection code.
>>
>> +	/* TODO: need a better cpu selection method */
>> +	for_each_cpu(cpu, per_node_cpumask) {
>> +		if (i >= total_mt_num)
>> +			break;
>> +		cpu_id_list[i] = cpu;
>> +		++i;
>> +	}
>>
>> to select CPUs from as many CCDs as possible and rerun the tests.
>> That might boost the page migration throughput on AMD CPUs more.
>>
>> Thanks.
>>
>>>>
>>>> THP Never
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>>>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>>>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>>>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>>>
>>>> THP Always
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>>>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>>>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>>>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>>>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>>>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>>>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>>>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>>>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>>>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>>>
>>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study
> 
> 
> BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
> The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
> vanilla kernel throughput using 8 or 16 threads.
> 
> 
> 4KB (GB/s)
> 
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
> | 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
> | 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
> | 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
> | 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |
> 
> 
> 
> 2MB (GB/s)
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
> | 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
> | 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
> | 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
> | 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
> | 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
> | 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
> | 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
> | 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
> | 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
> | 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
> | 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
> 

I see, thank you for checking.

Meanwhile, I'll continue to explore for performance optimization
avenues.

Best Regards,
Shivank
> 
> 
> Best Regards,
> Yan, Zi
>