This series introduces enhancements to the page migration code to optimize the "folio move" operations by batching them and enable offloading on DMA hardware accelerators. Page migration involves three key steps: 1. Unmap: Allocating dst folios and replace the src folio PTEs with migration PTEs. 2. TLB Flush: Flushing the TLB for all unmapped folios. 3. Move: Copying the page mappings, flags and contents from src to dst. Update metadata, lists, refcounts and restore working PTEs. While the first two steps (setting TLB flush pending for unmapped folios and TLB batch flush) been optimized with batching, this series focuses on optimizing the folio move step. In the current design, the folio move operation is performed sequentially for each folio: for_each_folio() { Copy folio metadata like flags and mappings Copy the folio content from src to dst Update PTEs with new mappings } In the proposed design, we batch the folio copy operations to leverage DMA offloading. The updated design is as follows: for_each_folio() { Copy folio metadata like flags and mappings } Batch copy the page content from src to dst by offloading to DMA engine for_each_folio() { Update PTEs with new mappings } Motivation: Data copying across NUMA nodes while page migration incurs significant overhead. For instance, folio copy can take up to 26.6% of the total migration cost for migrating 256MB of data. Modern systems are equipped with powerful DMA engines for bulk data copying. Utilizing these hardware accelerators will become essential for large-scale tiered-memory systems with CXL nodes where lots of page promotion and demotion can happen. Following the trend of batching operations in the memory migration core path (like batch migration and batch TLB flush), batch copying folio data is a logical progression in this direction. We conducted experiments to measure folio copy overheads for page migration from a remote node to a local NUMA node, modeling page promotions for different workload sizes (4KB, 2MB, 256MB and 1GB). Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled), 1 NUMA node connected to each socket. Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz. THP, compaction, numa_balancing are disabled to reduce interfernce. migrate_pages() { <- t1 .. <- t2 folio_copy() <- t3 .. } <- t4 overheads Fraction, F= (t3-t2)/(t4-t1) Measurement: Mean ± SD is measured in cpu_cycles/page Generic Kernel 4KB:: migrate_pages:17799.00±4278.25 folio_copy:794±232.87 F:0.0478±0.0199 2MB:: migrate_pages:3478.42±94.93 folio_copy:493.84±28.21 F:0.1418±0.0050 256MB:: migrate_pages:3668.56±158.47 folio_copy:815.40±171.76 F:0.2206±0.0371 1GB:: migrate_pages:3769.98±55.79 folio_copy:804.68±60.07 F:0.2132±0.0134 Results with patched kernel: 1. Offload disabled - folios batch-move using CPU 4KB:: migrate_pages:14941.60±2556.53 folio_copy:799.60±211.66 F:0.0554±0.0190 2MB:: migrate_pages:3448.44±83.74 folio_copy:533.34±37.81 F:0.1545±0.0085 256MB:: migrate_pages:3723.56±132.93 folio_copy:907.64±132.63 F:0.2427±0.0270 1GB:: migrate_pages:3788.20±46.65 folio_copy:888.46±49.50 F:0.2344±0.0107 2. Offload enabled - folios batch-move using DMAengine 4KB:: migrate_pages:46739.80±4827.15 folio_copy:32222.40±3543.42 F:0.6904±0.0423 2MB:: migrate_pages:13798.10±205.33 folio_copy:10971.60±202.50 F:0.7951±0.0033 256MB:: migrate_pages:13217.20±163.99 folio_copy:10431.20±167.25 F:0.7891±0.0029 1GB:: migrate_pages:13309.70±113.93 folio_copy:10410.00±117.77 F:0.7821±0.0023 Discussion: The DMAEngine achieved net throughput of 768MB/s. Additional optimizations are needed to make DMA offloading beneficial compared to CPU-based migration. This can include parallelism, specialized DMA hardware, asynchronous and speculative data migration. Status: Current patchset is functional, except for non-LRU folios. Dependencies: 1. This series is based on Linux-v6.8. 2. Patch 1,2,3 involve preparatory work and implementation for batching the folio move. Patch 4 adds support for DMA offload. 3. DMA hardware and driver support are required to enable DMA offload. Without suitable support, CPU is used for batch migration. Requirements are described in Patch 4. 4. Patch 5 adds a DMA driver using DMAengine APIs for end-to-end testing and validation. Testing: The patch series has been tested with migrate_pages(2) and move_pages(2) using anonymous memory and memory-mapped files. Byungchul Park (1): mm: separate move/undo doing on folio list from migrate_pages_batch() Mike Day (1): mm: add support for DMA folio Migration Shivank Garg (3): mm: add folios_copy() for copying pages in batch during migration mm: add migrate_folios_batch_move to batch the folio move operations dcbm: add dma core batch migrator for batch page offloading drivers/dma/Kconfig | 2 + drivers/dma/Makefile | 1 + drivers/dma/dcbm/Kconfig | 7 + drivers/dma/dcbm/Makefile | 1 + drivers/dma/dcbm/dcbm.c | 229 +++++++++++++++++++++ include/linux/migrate_dma.h | 36 ++++ include/linux/mm.h | 1 + mm/Kconfig | 8 + mm/Makefile | 1 + mm/migrate.c | 385 +++++++++++++++++++++++++++++++----- mm/migrate_dma.c | 51 +++++ mm/util.c | 22 +++ 12 files changed, 692 insertions(+), 52 deletions(-) create mode 100644 drivers/dma/dcbm/Kconfig create mode 100644 drivers/dma/dcbm/Makefile create mode 100644 drivers/dma/dcbm/dcbm.c create mode 100644 include/linux/migrate_dma.h create mode 100644 mm/migrate_dma.c -- 2.34.1