[RFC PATCH 0/2] Hot page promotion optimization for large address space

Bharata B Rao <bharata@xxxxxxx> · Wed, 27 Mar 2024 21:32:35 +0530

In order to check how efficiently the existing NUMA balancing
based hot page promotion mechanism can detect hot regions and
promote pages for workloads with large memory footprints, I
wrote and tested a program that allocates huge amount of
memory but routinely touches only small parts of it.

This microbenchmark provisions memory both on DRAM node and CXL node.
It then divides the entire allocated memory into chunks of smaller
size and randomly choses a chunk for generating memory accesses.
Each chunk is then accessed for a fixed number of iterations to
create the notion of hotness. Within each chunk, the individual
pages at 4K granularity are again accessed in random fashion.

When a chunk is taken up for access in this manner, its pages
can either be residing on DRAM or CXL. In the latter case, the NUMA
balancing driven hot page promotion logic is expected to detect and
promote the hot pages that reside on CXL.

The experiment was conducted on a 2P AMD Bergamo system that has
CXL as the 3rd node.

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-127,256-383
node 0 size: 128054 MB
node 1 cpus: 128-255,384-511
node 1 size: 128880 MB
node 2 cpus:
node 2 size: 129024 MB
node distances:
node   0   1   2 
  0:  10  32  60 
  1:  32  10  50 
  2:  255  255  10

It is seen that number of pages that get promoted is really low and
the reason for it happens to be that the NUMA hint fault latency turns
out to be much higher than the hot threshold most of the times. Here
are a few latency and threshold sample values captured from
should_numa_migrate_memory() routine when the benchmark was run:

latency	threshold (in ms)
20620	1125
56185	1125
98710	1250
148871	1375
182891	1625
369415	1875
630745	2000

The NUMA hint fault latency metric, which is based on absolute time
difference between scanning time and fault time may not be suitable
for applications that have large amounts of memory. If the time
difference between the scan time PTE update and the subsequent access
(hint fault) is more, the existing logic in should_numa_migrate_memory()
to determine if the page needs to be migrated, will exclude more
pages than it selects pages for promotion.

To address this problem, this RFC converts the absolute time based
hint fault latency in to a relative metric. The number of hint faults
that have occurred between the scan time and the page's fault time
is used as the threshold.

This is quite an experimental work and there are things to take
care of still. While more testing needs to be conducted with different
benchmarks, I am posting the patchset here to just get early feedback.

Microbenchmark
==============
Total allocation is 192G which initially occupies full of Node 1 (DRAM)
and half of Node 2 (CXL)
Chunk size is 1G

			Default		Patched

Benchmark score (us)	637,787,351	571,350,410 (-10.41%)
(Lesser is better)

numa_pte_updates	29,834,747	29,275,489
numa_hint_faults	12,512,736	12,080,772
numa_hint_faults_local	0		0
numa_pages_migrated	1,804,583	6,709,580
pgpromote_success	1,804,500	6,709,526
pgpromote_candidate	1,916,720	7,523,345
pgdemote_kswapd		5,358,119	9,438,006
pgdemote_direct		0		0

				Default		Patched
Number of times
should_numa_migrate_memory()
was invoked:			12,512,736	12,080,772

Number of times the migration
request was rejected due to
hint fault latency being
higher than threshold: 		10,595,933	4,557,401

Redis-memtier
=============
memtier_benchmark -t 512 -n 25000 --ratio 1:1 -c 20 -x 1 --key-pattern R:R
--hide-histogram --distinct-client-seed -d 20000 --pipeline=1000

			Default		Patched

Ops/sec			51,921.16	52,694.55
Hits/sec		21,908.72	22,235.03
Misses/sec		4051.86		4112.24
Avg. Latency		867.51710	591.27561 (-31.84%)
p50 Latency		876.54300	708.60700 (-19.15%)
p99 Latency		1044.47900	1044.47900
p99.9 Latency		1048.57500	1048.57500
KB/sec			937,330.19	951,291.76

numa_pte_updates	66,628,064	72,125,512
numa_hint_faults	57,093,369	63,369,538
numa_hint_faults_local	0		0
numa_pages_migrated	799,128		3,634,114
pgpromote_success	798,974		3,633,672
pgpromote_candidate	33,884,196	23,143,552
pgdemote_kswapd		13,321,784	11,948,894
pgdemote_direct		257		57,147

Bharata B Rao (2):
  sched/numa: Fault count based NUMA hint fault latency
  mm: Update hint fault count for pages that are skipped during scanning

 include/linux/mm.h       | 23 ++++---------
 include/linux/mm_types.h |  3 ++
 kernel/sched/debug.c     |  2 +-
 kernel/sched/fair.c      | 73 +++++++++++-----------------------------
 kernel/sched/sched.h     |  1 +
 mm/huge_memory.c         | 10 +++---
 mm/memory.c              |  2 ++
 mm/mprotect.c            | 14 ++++----
 8 files changed, 46 insertions(+), 82 deletions(-)

-- 
2.25.1