On Thu, 23 Jan 2025, Shivank Garg wrote: > Hi all, > > Zi Yan and I would like to propose the topic: Enhancements to Page > Migration with Multi-threading and Batch Offloading to DMA. > I think this would be a very useful topic to discuss, thanks for proposing it. > Page migration is a critical operation in NUMA systems that can incur > significant overheads, affecting memory management performance across > various workloads. For example, copying folios between DRAM NUMA nodes > can take ~25% of the total migration cost for migrating 256MB of data. > > Modern systems are equipped with powerful DMA engines for bulk data > copying, GPUs, and high CPU core counts. Leveraging these hardware > capabilities becomes essential for systems where frequent page promotion > and demotion occur - from large-scale tiered-memory systems with CXL nodes > to CPU-GPU coherent system with GPU memory exposed as NUMA nodes. > Indeed, there are multiple use cases for optimizations in this area. With the ramp of memory tiered systems, I think there will be an even greater reliance on memory migration going forward. Do you have numbers to share on how offloading, even as a proof of concept, moves the needle compared to traditional and sequential memory migration? > Existing page migration performs sequential page copying, underutilizing > modern CPU architectures and high-bandwidth memory subsystems. > > We have proposed and posted RFCs to enhance page migration through three > key techniques: > 1. Batching migration operations for bulk copying data [1] > 2. Multi-threaded folio copying [2] > 3. DMA offloading to hardware accelerators [1] > Curious: does memory migration of pages that are actively undergoing DMA with hardware assist fit into any of these? > By employing batching and multi-threaded folio copying, we are able to > achieve significant improvements in page migration throughput for large > pages. > > Discussion points: > 1. Performance: > a. Policy decision for DMA and CPU selection > b. Platform-specific scheduling of folio-copy worker threads for better > bandwidth utilization Why platform specific? I *assume* this means a generic framework that can optimize for scheduling based on the underlying hardware and not specific implementations that can only be used on AMD, for example. Is that the case? > c. Using Non-temporal instructions for CPU-based memcpy > d. Upscaling/downscaling worker threads based on migration size, CPU > availability (system load), bandwidth saturation, etc. > 2. Interface requirements with DMA hardware: > a. Standardizing APIs for DMA drivers and support for different DMA > drivers > b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine) > 3. Resources Accounting: > a. CPU cgroups accounting and fairness [3] > b. Who bears migration cost? - (Migration cost attribution) > > References: > [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@xxxxxxx > [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@xxxxxxxxxx > [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@xxxxxxxxxxxxxx >