Introduction: ============= This patchset is an outcome of an ongoing collaboration between AMD and Meta. Meta wanted to explore an alternative page promotion technique as they observe high latency spikes in their workloads that access CXL memory. In the current hot page promotion, all the activities including the process address space scanning, NUMA hint fault handling and page migration is performed in the process context. i.e., scanning overhead is borne by applications. This is an early RFC patch series to do (slow tier) CXL page promotion. The approach in this patchset assists/addresses the issue by adding PTE Accessed bit scanning. Scanning is done by a global kernel thread which routinely scans all the processes' address spaces and checks for accesses by reading the PTE A bit. It then migrates/promotes the pages to the toptier node (node 0 in the current approach). Thus, the approach pushes overhead of scanning, NUMA hint faults and migrations off from process context. Initial results show promising number on a microbenchmark. Experiment: ============ Abench microbenchmark, - Allocates 8GB/32GB of memory on CXL node - 64 threads created, and each thread randomly accesses pages in 4K granularity. - 512 iterations with a delay of 1 us between two successive iterations. SUT: 512 CPU, 2 node 256GB, AMD EPYC. 3 runs, command: abench -m 2 -d 1 -i 512 -s <size> Calculates how much time is taken to complete the task, lower is better. Expectation is CXL node memory is expected to be migrated as fast as possible. Base case: 6.11-rc6 w/ numab mode = 2 (hot page promotion is enabled). patched case: 6.11-rc6 w/ numab mode = 0 (numa balancing is disabled). we expect daemon to do page promotion. Result [*]: ======== base patched time in sec (%stdev) time in sec (%stdev) %gain 8GB 133.66 ( 0.38 ) 113.77 ( 1.83 ) 14.88 32GB 584.77 ( 0.19 ) 542.79 ( 0.11 ) 7.17 [*] Please note current patchset applies on 6.13-rc, but these results are old because latest kernel has issues in populating CXL node memory. Emailing findings/fix on that soon. Overhead: The below time is calculated using patch 10. Actual overhead for patched case may be even lesser. (scan + migration) time in sec Total memory base kernel patched kernel %gain 8GB 65.743 13.93 78.8114324 32GB 153.95 132.12 14.17992855 Breakup for 8GB base patched numa_task_work_oh 0.883 0 numa_hf_migration_oh 64.86 0 kmmscand_scan_oh 0 2.74 kmmscand_migration_oh 0 11.19 Breakup for 32GB base patched numa_task_work_oh 4.79 0 numa_hf_migration_oh 149.16 0 kmmscand_scan_oh 0 23.4 kmmscand_migration_oh 0 108.72 Limitations: =========== PTE A bit scanning approach lacks information about exact destination node to migrate to. Notes/Observations on design/Implementations/Alternatives/TODOs... ================================ 1. Fine-tuning scan throttling 2. Use migrate_balanced_pgdat() to balance toptier node before migration OR Use migrate_misplaced_folio_prepare() directly. But it may need some optimizations (for e.g., invoke occasionaly so that overhead is not there for every migration). 3. Explore if a separate PAGE_EXT flag is needed instead of reusing PAGE_IDLE flag (cons: complicates PTE A bit handling in the system), But practically does not look good idea. 4. Use timestamp information-based migration (Similar to numab mode=2). instead of migrating immediately when PTE A bit set. (cons: - It will not be accurate since it is done outside of process context. - Performance benefit may be lost.) 5. Explore if we need to use PFN information + hash list instead of simple migration list. Here scanning is directly done with PFN belonging to CXL node. 6. Holding PTE lock before migration. 7. Solve: how to find target toptier node for migration. 8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of physical addresses accessed. 9. Gregory has nicely mentioned some details/ideas on different approaches in [1] : development notes, in the context of promoting unmapped page cache folios. 10. SJ had pointed about concerns about kernel-thread based approaches as in kstaled [2]. So current patchset has tried to address the issue with simple algorithms to reduce CPU overhead. Migration throttling, Running the daemon in NICE priority, Parallelizing migration with scanning could help further. 11. Toptier pages scanned can be used to assist current NUMAB by providing information on hot VMAs. Credits ======= Thanks to Bharata, Joannes, Gregory, SJ, Chris for their valuable comments and support. Kernel thread skeleton and some part of the code is hugely inspired by khugepaged implementation and some part of IBS patches from Bharata [3]. Looking forward for your comment on whether the current approach in this *early* RFC looks promising, or are there any alternative ideas etc. Links: [1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@xxxxxxxxxx/ [2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@xxxxxxxxxx/#r [3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ I might have CCed more people or less people than needed unintentionally. Raghavendra K T (10): mm: Add kmmscand kernel daemon mm: Maintain mm_struct list in the system mm: Scan the mm and create a migration list mm/migration: Migrate accessed folios to toptier node mm: Add throttling of mm scanning using scan_period mm: Add throttling of mm scanning using scan_size sysfs: Add sysfs support to tune scanning vmstat: Add vmstat counters trace/kmmscand: Add tracing of scanning and migration kmmscand: Add scanning fs/exec.c | 4 + include/linux/kmmscand.h | 30 + include/linux/mm.h | 14 + include/linux/mm_types.h | 4 + include/linux/vm_event_item.h | 14 + include/trace/events/kmem.h | 99 +++ kernel/fork.c | 4 + kernel/sched/fair.c | 13 +- mm/Kconfig | 7 + mm/Makefile | 1 + mm/huge_memory.c | 1 + mm/kmmscand.c | 1144 +++++++++++++++++++++++++++++++++ mm/memory.c | 12 +- mm/vmstat.c | 14 + 14 files changed, 1352 insertions(+), 9 deletions(-) create mode 100644 include/linux/kmmscand.h create mode 100644 mm/kmmscand.c base-commit: bcc8eda6d34934d80b96adb8dc4ff5dfc632a53a -- 2.39.3