[RFC PATCH] A Summary of VMA scanning improvements explored

Raghavendra K T <raghavendra.kt@xxxxxxx> · Fri, 22 Mar 2024 19:11:11 +0530

I am posting the summary of numa balancing improvements tried out.

(Intention is RFC and revisiting these in future when some one
sees potential benefits with PATCH1 and PATCH2).

PATCH3 has more potential for workloads that needs aggressive scanning
but may need migration ratelimiting.

Pathset details:
==================
PATCH 1. Increase the number of access PID (information of tasks accessing
VMA) history windows from 2 to 4

Based on PeterZ's suggestion/patch.
Rationale:
- Increases the depth of historical access of tasks
- Get a better view of hot VMAs
- Get a better view of VMA which are widely shared amongst tasks
with that we can take better decision in choosing the VMAs that needs to
be scanned for introducing PROT_NONE.

PATCH 2. Increase the number of bit used to map tasks accessing VMA from 64 to 128bit

Based on suggestion by Ingo
Rationale:
Decrease the number of collisions (false positive), while whole information still
fits in a cacheline

This is potentially helpful when workload involve more threads and thus,
- unnecessarily do VMA scan.
- create contention in scan path.

PATCH 3. Change the notion of scanning 256MB limit per scan to 64k PTE scan (for 4k).
Extend the same logic to hugepages / THP pages.

Based on suggestion by Mel

Rationale: This helps to cover more memory especially when THP is involved or
a hugepage is involved.

PS: Please note all 3 are independent patches. Apologies in advance if patchset
confuses any patching script. Also more comment/details will be added
for patches of interest.

Summary of results:
==================
PATCH1 and PATCH2 are giving benefit in some cases I ran but they may still need
more convincing usecase / results (as on 6.9+ kernel).

PATCH3:
Some benchmarks such as XSBench Hashjoin are benefiting from more scanning
But microbenchmarks (such as allocate on one node fault from other node to
see how  fast migration happen), suffer because of aggressive migration overhead.

Overall if we combine ratelimiting of migration (similar to CXL) or tune the
scan rate when it is not necessary to scan (for e.g., I still see VMA scanning
does not slow even when rate of migration slowed down or all migrations completed.)

Change stat for each of the patches
======================
PATCH 1:

Raghavendra K T (1):
  sched/numa: Hot VMA and shared VMA optimization

 include/linux/mm.h       | 12 ++++++---
 include/linux/mm_types.h | 11 +++++---
 kernel/sched/fair.c      | 58 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 69 insertions(+), 12 deletions(-)

base-commit: b0546776ad3f332e215cebc0b063ba4351971cca
============================
PATCH 2:

Raghavendra K T (1):
  sched/numa: Increase the VMA accessing PID bits

 include/linux/mm.h       | 29 ++++++++++++++++++++++++++---
 include/linux/mm_types.h |  7 ++++++-
 kernel/sched/fair.c      | 21 ++++++++++++++++-----
 3 files changed, 48 insertions(+), 9 deletions(-)

base-commit: b0546776ad3f332e215cebc0b063ba4351971cca
===========================
PATCH 3:

Raghavendra K T (1):
  sched/numa: Convert 256MB VMA scan limit notion

 include/linux/hugetlb.h |  3 +-
 include/linux/mm.h      | 16 +++++++-
 kernel/sched/fair.c     | 15 ++++---
 mm/hugetlb.c            |  9 +++++
 mm/mempolicy.c          | 11 +++++-
 mm/mprotect.c           | 87 +++++++++++++++++++++++++++++++++--------
 6 files changed, 115 insertions(+), 26 deletions(-)

base-commit: b0546776ad3f332e215cebc0b063ba4351971cca
-- 
2.34.1