Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit

Raghavendra K T <raghavendra.kt@xxxxxxx> · Fri, 21 Mar 2025 00:41:20 +0530

On 3/20/2025 2:21 PM, Raghavendra K T wrote:
On 3/20/2025 4:30 AM, Davidlohr Bueso wrote:
On Wed, 19 Mar 2025, Raghavendra K T wrote:

Introduction:
=============
In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning 
overhead is
borne by applications.

This is RFC V1 patch series to do (slow tier) CXL page promotion.
The approach in this patchset assists/addresses the issue by adding PTE
Accessed bit scanning.

Scanning is done by a global kernel thread which routinely scans all
the processes' address spaces and checks for accesses by reading the
PTE A bit.

A separate migration thread migrates/promotes the pages to the toptier
node based on a simple heuristic that uses toptier scan/access 
information
of the mm.

Additionally based on the feedback for RFC V0 [4], a prctl knob with
a scalar value is provided to control per task scanning.

Initial results show promising number on a microbenchmark. Soon
will get numbers with real benchmarks and findings (tunings).

Experiment:
============
Abench microbenchmark,
- Allocates 8GB/16GB/32GB/64GB of memory on CXL node
- 64 threads created, and each thread randomly accesses pages in 4K
 granularity.
- 512 iterations with a delay of 1 us between two successive iterations.

SUT: 512 CPU, 2 node 256GB, AMD EPYC.

3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>

Calculates how much time is taken to complete the task, lower is better.
Expectation is CXL node memory is expected to be migrated as fast as
possible.

Base case: 6.14-rc6    w/ numab mode = 2 (hot page promotion is 
enabled).
patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
we expect daemon to do page promotion.

Result:
========
        base NUMAB2                    patched NUMAB1
        time in sec  (%stdev)   time in sec  (%stdev)     %gain
8GB     134.33       ( 0.19 )        120.52  ( 0.21 )     10.28
16GB     292.24       ( 0.60 )        275.97  ( 0.18 )      5.56
32GB     585.06       ( 0.24 )        546.49  ( 0.35 )      6.59
64GB    1278.98       ( 0.27 )       1205.20  ( 2.29 )      5.76

Base case: 6.14-rc6    w/ numab mode = 1 (numa balancing is enabled).
patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
        base NUMAB1                    patched NUMAB1
        time in sec  (%stdev)   time in sec  (%stdev)     %gain
8GB     186.71       ( 0.99 )        120.52  ( 0.21 )     35.45
16GB     376.09       ( 0.46 )        275.97  ( 0.18 )     26.62
32GB     744.37       ( 0.71 )        546.49  ( 0.35 )     26.58
64GB    1534.49       ( 0.09 )       1205.20  ( 2.29 )     21.45

Very promising, but a few things. A more fair comparison would be
vs kpromoted using the PROT_NONE of NUMAB2. Essentially disregarding
the asynchronous migration, and effectively measuring synchronous
vs asynchronous scanning overhead and implied semantics. Essentially
save the extra kthread and only have a per-NUMA node migrator, which
is the common denominator for all these sources of hotness.

Yes, I agree that fair comparison would be
1) kmmscand generating data on pages to be promoted working with
kpromoted asynchronously migrating
VS
2) NUMAB2 generating data on pages to be migrated integrated with
kpromoted.

As Bharata already mentioned, we tried integrating kpromoted with
kmmscand generated migration list, But kmmscand generates huge amount of
scanned page data, and need to be organized better so that kpromted can 
handle the migration effectively.

(2) We have not tried it yet, will get back on the possibility (and also
numbers when both are ready).

Similarly, while I don't see any users disabling NUMAB1 _and_ enabling
this sort of thing, it would be useful to have data on no numa balancing
at all. If nothing else, that would measure the effects of the dest
node heuristics.

Last time when I checked, with patch, numbers with NUMAB=0 and NUMAB=1
was not making much difference in 8GB case because most of the migration 
was handled by kmmscand. It is because before NUMAB=1 learns and tries
to migrate, kmmscand would have already migrated.

But a longer running/ more memory workload may make more difference.
I will comeback with that number.

                 base NUMAB=2   Patched NUMAB=0
                 time in sec    time in sec
===================================================
8G:              134.33 (0.19)   119.88 ( 0.25)
16G:             292.24 (0.60)   325.06 (11.11)
32G:             585.06 (0.24)   546.15 ( 0.50)
64G:            1278.98 (0.27)  1221.41 ( 1.54)

We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
patched case.

PS: for 16G there was a bad case where a rare contention happen for lock
for same mm. that we can see from stdev, which should be taken care in
next version.

[...]