On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote: > We conducted experiments to measure folio copy overheads for page > migration from a remote node to a local NUMA node, modeling page > promotions for different workload sizes (4KB, 2MB, 256MB and 1GB). > > Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT > Enabled), 1 NUMA node connected to each socket. > Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz. > THP, compaction, numa_balancing are disabled to reduce interfernce. > > migrate_pages() { <- t1 > .. > <- t2 > folio_copy() > <- t3 > .. > } <- t4 > > overheads Fraction, F= (t3-t2)/(t4-t1) > Measurement: Mean ± SD is measured in cpu_cycles/page > Generic Kernel > 4KB:: migrate_pages:17799.00±4278.25 folio_copy:794±232.87 F:0.0478±0.0199 > 2MB:: migrate_pages:3478.42±94.93 folio_copy:493.84±28.21 F:0.1418±0.0050 > 256MB:: migrate_pages:3668.56±158.47 folio_copy:815.40±171.76 F:0.2206±0.0371 > 1GB:: migrate_pages:3769.98±55.79 folio_copy:804.68±60.07 F:0.2132±0.0134 > > Results with patched kernel: > 1. Offload disabled - folios batch-move using CPU > 4KB:: migrate_pages:14941.60±2556.53 folio_copy:799.60±211.66 F:0.0554±0.0190 > 2MB:: migrate_pages:3448.44±83.74 folio_copy:533.34±37.81 F:0.1545±0.0085 > 256MB:: migrate_pages:3723.56±132.93 folio_copy:907.64±132.63 F:0.2427±0.0270 > 1GB:: migrate_pages:3788.20±46.65 folio_copy:888.46±49.50 F:0.2344±0.0107 > > 2. Offload enabled - folios batch-move using DMAengine > 4KB:: migrate_pages:46739.80±4827.15 folio_copy:32222.40±3543.42 F:0.6904±0.0423 > 2MB:: migrate_pages:13798.10±205.33 folio_copy:10971.60±202.50 F:0.7951±0.0033 > 256MB:: migrate_pages:13217.20±163.99 folio_copy:10431.20±167.25 F:0.7891±0.0029 > 1GB:: migrate_pages:13309.70±113.93 folio_copy:10410.00±117.77 F:0.7821±0.0023 You haven't measured the important thing though -- what's the cost _to userspace_? When the CPU does the copy, the data is now cache-hot in that CPU's cache. When the DMA engine does the copy, it's not cache-hot in any CPU. Now, this may not be a big problem. I don't think we do anything to ensure that the CPU that is going to access the folio in userspace is the one which does the copy. But your methodology is wrong.