[RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit

Raghavendra K T <raghavendra.kt@xxxxxxx> · Sun, 1 Dec 2024 15:38:08 +0000

Introduction:
=============
This patchset is an outcome of an ongoing collaboration between AMD and Meta.
Meta wanted to explore an alternative page promotion technique as they
observe high latency spikes in their workloads that access CXL memory.

In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning overhead is
borne by applications.

This is an early RFC patch series to do (slow tier) CXL page promotion.
The approach in this patchset assists/addresses the issue by adding PTE
Accessed bit scanning.

Scanning is done by a global kernel thread which routinely scans all
the processes' address spaces and checks for accesses by reading the
PTE A bit. It then migrates/promotes the pages to the toptier node
(node 0 in the current approach).

Thus, the approach pushes overhead of scanning, NUMA hint faults and
migrations off from process context.

Initial results show promising number on a microbenchmark.

Experiment:
============
Abench microbenchmark,
- Allocates 8GB/32GB of memory on CXL node
- 64 threads created, and each thread randomly accesses pages in 4K
  granularity.
- 512 iterations with a delay of 1 us between two successive iterations.

SUT: 512 CPU, 2 node 256GB, AMD EPYC.

3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>

Calculates how much time is taken to complete the task, lower is better.
Expectation is CXL node memory is expected to be migrated as fast as
possible.

Base case: 6.11-rc6    w/ numab mode = 2 (hot page promotion is enabled).
patched case: 6.11-rc6 w/ numab mode = 0 (numa balancing is disabled).
we expect daemon to do page promotion.

Result [*]:
========
         base                    patched
         time in sec  (%stdev)   time in sec  (%stdev)     %gain
 8GB     133.66       ( 0.38 )        113.77  ( 1.83 )     14.88
32GB     584.77       ( 0.19 )        542.79  ( 0.11 )      7.17

[*] Please note current patchset applies on 6.13-rc, but these results
are old because latest kernel has issues in populating CXL node memory.
Emailing findings/fix on that soon.

Overhead:
The below time is calculated using patch 10. Actual overhead for patched
case may be even lesser.

               (scan + migration)  time in sec
Total memory   base kernel    patched kernel       %gain
8GB             65.743          13.93              78.8114324
32GB           153.95          132.12              14.17992855

Breakup for 8GB         base    patched
numa_task_work_oh       0.883   0
numa_hf_migration_oh   64.86    0
kmmscand_scan_oh        0       2.74
kmmscand_migration_oh   0      11.19

Breakup for 32GB        base    patched
numa_task_work_oh       4.79     0
numa_hf_migration_oh   149.16    0
kmmscand_scan_oh         0      23.4
kmmscand_migration_oh    0     108.72

Limitations:
===========
PTE A bit scanning approach lacks information about exact destination
node to migrate to.

Notes/Observations on design/Implementations/Alternatives/TODOs...
================================
1. Fine-tuning scan throttling

2. Use migrate_balanced_pgdat() to balance toptier node before migration
 OR Use migrate_misplaced_folio_prepare() directly.
 But it may need some optimizations (for e.g., invoke occasionaly so
that overhead is not there for every migration).

3. Explore if a separate PAGE_EXT flag is needed instead of reusing
PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
But practically does not look good idea.

4. Use timestamp information-based migration (Similar to numab mode=2).
instead of migrating immediately when PTE A bit set.
(cons:
 - It will not be accurate since it is done outside of process
context.
 - Performance benefit may be lost.)

5. Explore if we need to use PFN information + hash list instead of
simple migration list. Here scanning is directly done with PFN belonging
to CXL node.

6. Holding PTE lock before migration.

7. Solve: how to find target toptier node for migration.

8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
physical addresses accessed.

9. Gregory has nicely mentioned some details/ideas on different approaches in
[1] : development notes, in the context of promoting unmapped page cache folios.

10. SJ had pointed about concerns about kernel-thread based approaches as in
kstaled [2]. So current patchset has tried to address the issue with simple
algorithms to reduce CPU overhead. Migration throttling, Running the daemon
in NICE priority, Parallelizing migration with scanning could help further.

11. Toptier pages scanned can be used to assist current NUMAB by providing information
on hot VMAs.

Credits
=======
Thanks to Bharata, Joannes, Gregory, SJ, Chris for their valuable comments and
support.

Kernel thread skeleton and some part of the code is hugely inspired by khugepaged
implementation and some part of IBS patches from Bharata [3].

Looking forward for your comment on whether the current approach in this
*early* RFC looks promising, or are there any alternative ideas etc.

Links:
[1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@xxxxxxxxxx/
[2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@xxxxxxxxxx/#r
[3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

I might have CCed more people or less people than needed
unintentionally.

Raghavendra K T (10):
  mm: Add kmmscand kernel daemon
  mm: Maintain mm_struct list in the system
  mm: Scan the mm and create a migration list
  mm/migration: Migrate accessed folios to toptier node
  mm: Add throttling of mm scanning using scan_period
  mm: Add throttling of mm scanning using scan_size
  sysfs: Add sysfs support to tune scanning
  vmstat: Add vmstat counters
  trace/kmmscand: Add tracing of scanning and migration
  kmmscand: Add scanning

 fs/exec.c                     |    4 +
 include/linux/kmmscand.h      |   30 +
 include/linux/mm.h            |   14 +
 include/linux/mm_types.h      |    4 +
 include/linux/vm_event_item.h |   14 +
 include/trace/events/kmem.h   |   99 +++
 kernel/fork.c                 |    4 +
 kernel/sched/fair.c           |   13 +-
 mm/Kconfig                    |    7 +
 mm/Makefile                   |    1 +
 mm/huge_memory.c              |    1 +
 mm/kmmscand.c                 | 1144 +++++++++++++++++++++++++++++++++
 mm/memory.c                   |   12 +-
 mm/vmstat.c                   |   14 +
 14 files changed, 1352 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/kmmscand.h
 create mode 100644 mm/kmmscand.c

base-commit: bcc8eda6d34934d80b96adb8dc4ff5dfc632a53a
-- 
2.39.3