Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA

Shivank Garg <shivankg@xxxxxxx> · Tue, 28 Jan 2025 12:24:32 +0530

Hi David, Zi,

On 1/27/2025 6:07 PM, Zi Yan wrote:
> On 27 Jan 2025, at 1:55, David Rientjes wrote:
> 
>> On Thu, 23 Jan 2025, Shivank Garg wrote:
>>
>>> Hi all,
>>>
>>> Zi Yan and I would like to propose the topic: Enhancements to Page
>>> Migration with Multi-threading and Batch Offloading to DMA.
>>>
>>
>> I think this would be a very useful topic to discuss, thanks for proposing
>> it.

Thanks for your interest in our proposal. 

>>
>>> Page migration is a critical operation in NUMA systems that can incur
>>> significant overheads, affecting memory management performance across
>>> various workloads. For example, copying folios between DRAM NUMA nodes
>>> can take ~25% of the total migration cost for migrating 256MB of data.
>>>
>>> Modern systems are equipped with powerful DMA engines for bulk data
>>> copying, GPUs, and high CPU core counts. Leveraging these hardware
>>> capabilities becomes essential for systems where frequent page promotion
>>> and demotion occur - from large-scale tiered-memory systems with CXL nodes
>>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
>>>
>>
>> Indeed, there are multiple use cases for optimizations in this area.  With
>> the ramp of memory tiered systems, I think there will be an even greater
>> reliance on memory migration going forward.
>>
>> Do you have numbers to share on how offloading, even as a proof of
>> concept, moves the needle compared to traditional and sequential memory
>> migration?
> 
> For multithreaded page migration, you can see my RFC patchset[1]:
> 
> on NVIDIA Grace:
> 
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
> 
> 64KB (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
> 256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
> 512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
> 768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
> 1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80
> 
> 2MB mTHP (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
> 2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
> 4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
> 8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
> 16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
> 32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
> 64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
> 128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
> 256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
> 512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
> 768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
> 1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84
> 
> 
> I also ran it on on a two socket Xeon E5-2650 v4:
> 
> 
> 4KB (GB/s)
> 
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
> | 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
> | 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
> | 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
> | 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |
> 
> 
> 
> 2MB (GB/s)
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
> | 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
> | 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
> | 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
> | 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
> | 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
> | 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
> | 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
> | 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
> | 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
> | 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
> | 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
> 
> 
> 
> Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs):
> 
> 2MB pages (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 1                   10.74     11.04     4.68      8.17      6.47      6.09      3.97      6.20
> 2                   12.44     4.90      11.19     14.10     15.33     8.45      10.09     9.97
> 4                   14.82     9.80      11.93     18.35     21.82     17.09     10.53     7.51
> 8                   16.13     9.91      15.26     11.85     26.53     13.09     12.71     13.75
> 16                  15.99     8.81      13.84     22.43     33.89     11.91     12.30     13.26
> 32                  14.03     11.37     17.54     23.96     57.07     18.78     19.51     21.29
> 64                  15.79     9.55      22.19     33.17     57.18     65.51     55.39     62.53
> 128                 18.22     16.65     21.49     30.73     52.99     61.05     58.44     60.38
> 256                 19.78     20.56     24.72     34.94     56.73     71.11     61.83     62.77
> 512                 20.27     21.40     27.47     39.23     65.72     67.97     70.48     71.39
> 1024                20.48     21.48     27.48     38.30     68.62     77.94     78.00     78.95
> 
> 
> 
>>
>>> Existing page migration performs sequential page copying, underutilizing
>>> modern CPU architectures and high-bandwidth memory subsystems.
>>>
>>> We have proposed and posted RFCs to enhance page migration through three
>>> key techniques:
>>> 1. Batching migration operations for bulk copying data [1]
>>> 2. Multi-threaded folio copying [2]
>>> 3. DMA offloading to hardware accelerators [1]
>>>
>>
>> Curious: does memory migration of pages that are actively undergoing DMA
>> with hardware assist fit into any of these?
> 
> It should be similar to 3, but in this case, DMA is used to copy pages
> between NUMA nodes, whereas traditional DMA page migration is used to copy
> pages between host and devices.
> 

I'm planning to test using SDXi as the DMA engine for offload and it
doesn't support migrating pages that are actively undergoing DMA AFAIU.

>>
>>> By employing batching and multi-threaded folio copying, we are able to
>>> achieve significant improvements in page migration throughput for large
>>> pages.
>>>
>>> Discussion points:
>>> 1. Performance:
>>>    a. Policy decision for DMA and CPU selection
>>>    b. Platform-specific scheduling of folio-copy worker threads for better
>>>       bandwidth utilization
>>
>> Why platform specific?  I *assume* this means a generic framework that can
>> optimize for scheduling based on the underlying hardware and not specific
>> implementations that can only be used on AMD, for example.  Is that the
>> case?
> 
> I think the framework will be generic but the CPU scheduling (which core
> to choose for page copying) will be different from vendor to vendor.
> 
> Due to existing CPU structure, like chiplet design, a single CPU scheduling
> algorithm does not fit for CPUs from different vendors. For example, on
> NVIDIA Grace, you can use any CPUs to copy pages and always achieve high
> page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy
> threads across different CCDs can achieve much higher page copy throughput
> than putting all threads in a single CCD. I assume Intel CPUs with chiplet
> design would see the same result.

Thank you Zi for helping with results and queries.

> 
>>
>>>    c. Using Non-temporal instructions for CPU-based memcpy
>>>    d. Upscaling/downscaling worker threads based on migration size, CPU
>>>       availability (system load), bandwidth saturation, etc.
>>> 2. Interface requirements with DMA hardware:
>>>    a. Standardizing APIs for DMA drivers and support for different DMA
>>>       drivers
>>>    b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
>>> 3. Resources Accounting:
>>>    a. CPU cgroups accounting and fairness [3]
>>>    b. Who bears migration cost? - (Migration cost attribution)
>>>
>>> References:
>>> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@xxxxxxx
>>> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@xxxxxxxxxx
>>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@xxxxxxxxxxxxxx
>>>
> 
> [1] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@xxxxxxxxxx/
> --
> Best Regards,
> Yan, Zi
>