Re: Slow-tier Page Promotion discussion recap and open questions

Raghavendra K T <rkodsara@xxxxxxx> · Fri, 20 Dec 2024 16:51:57 +0530

On 12/18/2024 9:49 AM, David Rientjes wrote:
Hi everybody,

Hello David,
This is an excellent recap and summary.
Thank you for this summary and also the opportunity.

Some additions to the points where I had not elaborated enough during
discussion. (also may be some more points along the way).

We had a very interactive discussion last week led by RaghavendraKT on
slow-tier page promotion intended for memory tiering platforms, thank
you!  Thanks as well to everybody who attended and provided great
questions, suggestions, and feedback.

The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
is a proposal to allow for asynchronous page promotion based on memory
accesses as an alternative to NUMA Balancing based promotions.  There was
widespread interest in this topic and the discussion surfaced multiple
use cases and requirements, very focused on CXL use cases.>
----->o-----
Raghu noted that the current approach utilizing NUMA Balancing focuses on
scan and *migration* in process context, which often gets observed as
latency spikes.  This led to an idea for scanning of the PTE Accessed bit
and promotion to be handled by a kthread instead.  In Raghu's proposal,
this is called kmmscand.  For every mm on the system, the vmas are
scanned and a migration list is created that feeds into page migration.

To avoid scanning the entire process address space, however, there is a
per-process scan period and scan size.  Scanning the vmas continue while
still in the scan period.  When the scan size is complete, the scanning
transitions into the migration phase.

High level, the scan period and scan size are adjusted based on the
accessed folios that were observed in the last scan.

----->o-----
I asked if this was really done single threaded, which was confirmed.  If
only a single process has pages on a slow memory tier, for example, then
flexible tuning of the scan period and size ensures we do not scan
needlessly.  The scan period can be tuned to be more responsive (down to
400ms in this proposal) depending on how many accesses we have on the
last scan; similarly, it can be much less responsive (up to 5s) if memory
is not found to be accessed.

I also asked if scanning can be disabled entirely, Raghu clarified that
it cannot be.

We have a sysfs tunable (kmmscand/scan_enabled) to enable/disable the
whole scanning at a global level but not at process level granularity.

Wei Xu asked if the scan period should be interpreted as the minimal
interval between scans because kmmscand is single threaded and there are
many processes.  Raghu confirmed this is correct, the minimal delay.
Even if the scan period is 400ms, in reality it could be multiple seconds
based on load.

Liam Howlett asked how we could have two scans colliding in a time
segment.  Raghu noted if we are able to complete the last scan in less
time than 400ms, then we have this delay to avoid continuously scanning
that results in increased cpu overhead.  Liam further asked if processes
opt into a scan or out of the scan, Raghu noted we always scan every
process.  John Hubbard suggested that we have per-process control.

+1 for prctl()

Also I want to add that, I will get data on:

what is the min and max time required to finish the entire scan for the
current micro-benchmark and one of the real workload (such as Redis/
Rocksdb...), so that we can check if we are meeting the deadline of
scanning with single kthread.

----->o-----
Zi Yan asked a great question about how this would interact with LRU
information used for page reclaim.  The scanning could interfere with
cold page detection because it manipulates the Accessed bits.

Wei noted that the kernel leverages page_young for this so during
scanning we need to transfer the Accessed bit information into
page_young.  This is what idle page tracking currently uses to not
interfere with anything that harvests the Accessed bit.  The scan only
cares about the Accessed bit.

Zi asked how this would be handled if processes are allowed to opt out,
in other words, if some processes are propagating their Accessed bits to
page_young and others are not.  Wei clarified that for page reclaim, the
Accessed bit and page_young should both be checked and are treated
equally.

Wei noted a subtlety here where MGLRU does not currently check
page_young.  Since multiple users of the Accessed bit exist, MGLRU should
likely check page_young as well.

Bharata B Rao noted this is equivalent to how idle page tracking handles
this behavior as well as DAMON.

I think not much change is expected here.

----->o-----
John Hubbard suggested that this scanning may very well be multi-threaded
and there's no explicit reason to avoid it.  (I didn't bring it up at the
time, but I think this is required just for NUMA purposes.)  Otherwise it
won't scale well.  Raghu noted we have a global mm list, but will think
about this for future iterations.

Ideally when a kthread was intended for keeping top-tier data, we could
easily have a kthread / top-tier node that affine to CPUs spanning the
top-tier node + hotplug callbacks.

Here we had a single CPU-less CXL node, so I  went ahead with single
kthread without hotplug callbacks

I do agree, eventually we need to have one per slow-tier OR one per
all the nodes available in the system.

----->o-----
Raghu noted the current promotion destination is node 0 by default.  Wei
noted we could get some page owner information to determine things like
mempolicies or compute the distance between nodes and, if multiple nodes
have the same distance, choose one of them just as we do for demotions.

Gregory Price noted some downsides to using mempolicies for this based on
per-task, per-vma, and cross socket policies, so using the kernel's
memory tiering policies is probably the best way to go about it.

----->o-----
Wei asked about benchmark results and why migration time was reduced
given the same amount of memory to migrate.  Raghu noted the only
difference was the migration path, so things like kswapd or page
allocation did not spend a lot of time trying to reclaim memory for the
migration to succeed.  This can happen if migrating to a nearly full
target NUMA node.

Raghu also noted that the migration time is not exactly comparable
between NUMA Balancing and kmmscand.  We're also not tracking things like
timestamp and storing state to migrate after multiple accesses.  Zi also
noted that migrating batched memory has some optimizations especially for
tlb shootdowns.

+1

----->o-----
Wei noted an important point about separating hot page detection and
promotion, which don't actually need to be coupled at all.  This uses
page table scanning while future support may not need to leverage this at
all.  We'd very much like to avoid multiple promotion solutions for
different ways to track page hotness.

I strongly supported this because I believe for CXL, at least within the
next three years, that memory hotness will likely not be derived from
page table Accessed bit scanning.  Zi Yan agreed.

The promotion path may also want to be much less aggressive than on first
access.  Raghu showed many improvements, including handling short lived
processes, more accurate hot page detection using timestamp, etc.

Some of these TODOs can be implemented in next version.

----->o-----
I asked about offloading the migration to a data mover, such as the PSP
for AMD, DMA engine, etc and whether that should be treated entirely
separately as a topic.  Bharata said there was a proof-of-concept
available from AMD that does just that but the initial results were not
that encouraging.

Zi asked if the DMA engine saturated the link between the slow and fast
tiers.  If we want to offload to a copy engine, we need to verify that
the throughput is sufficient or we may be better off using idle cpus to
perform the migration for us.

----->o-----
I followed up on a discussion point early in the talk about whether this
should be virtual address scanning like the current approach, walking
mm_struct's, or the alternative approach which would be physical address
scanning.

Raghu sees this as a fully alternative approach such as what DAMON uses
that is based on rmap.  The only advantage appears to be avoiding
scanning on top tier memory completely.

Having a clarity here would help. Both the approaches have its own pros
and cons.

Need to also explore on using / Reusing DMAON/ MGLRU.. to the extent 
possible based on the approach.

----->o-----
Wei noted there was a lot of similarities between the RFC implementation
and the MGLRU page walk functionality and whether it would make sense to
try to converge these together or make more generally useful.

+1

SeongJae noted that if DAMON logic were used for the scanning that we
could re-use the existing support for controlling the overhead.

+1

John echoed the idea of leveraging the learnings from MGLRU in this,
additionally for trying to get more use of MGLRU.  Wei noted there are
MGLRU optimizations that we can leverage such as when the pmd Accessed
bit is clear we don't need to iterate any further for that scan.

Agree.

----->o-----
My takeaways:

  - the memory tiering discussion that I led at LSF/MM/BPF this year also
    focused on asynchronous memory migration, decoupled from NUMA
    Balancing and I strongly believe this is the right direction

Strongly agree.

  - the per-process control seems important and with no obvious downsides
    as John noted, so likely better to ensure that some processes can opt
    out of scanning with a prctl()

+1

  - it likely makes sense for MGLRU to also check page_young as Wei noted
    so this deals with the transfer of the Accessed bit to page_young
    evenly for all processes, even when opting out

  - we likely want to reconsider the single threaded nature of the kthread
    even if only for NUMA purposes

  - using node 0 for all target migrations is only for illustrative
    purposes, this will definitely need to be more thought out such as
    using the kernel's understanding of the memory tiers on the system as
    Gregory pointed out

Agree. I hope with some more brainstorming, we could achieve this.

  - we want to ensure that the promotion node is a very reasonable
    destination target, it would be unfortunate to rely on NUMA Balancing
    to then migrate memory again once it's promoted to get the proper
    affinity :)

Strongly agree. Promoting to a wrong top-tier node looses entire
benefit because of ratio of access latency between remote node and CXL 
node we have currently.

  - promotion on first access will likely need to be reconsidered, which
    is not even used by NUMA Balancing.  We'll likely need to store some
    state to promote memory that is repeatedly being accessed as opposed
    to treating a single access as though the memory must be promoted

Just thinking loud here, how about using first access as a feeder for
independent hot-page detection module? Current approach is state-less,
i.e. once we determine if the page was accessed, we add to migration
list and forget about that.

Can we have this as a feeder for normal NUMAB algorithm to detect hot VMAs?

Reason I took approach was,  if we had to go for time-stamp / access 
history based logic.. we should be using hashlist, + some finite hash
bucket size etc.
Current microbenchmark that involved 8GB CXL hot page itself involved 2
million pages in a very short span.

Either way, need some more thought here.

  - there is a definite need to separate hot page detection and the
    promotion path since hot pages may be derived from multiple sources,
    including hardware assists in the future

  - for the hot page tracking itself, a common abstraction to be used that
    can effectively describe hotness regardless of the backend it is
    deriving its information from would likely be quite useful

+1

  - I think virtual memory scanning is likely the only viable approach for
    this purpose and we could store state in the underlying struct page,
    similar to NUMA Balancing, but that all scanning should be driven by
    walking the mm_struct's to harvest the Accessed bit
>   - re-using the MGLRU page walk implementation would likely make the
    kmmscand scanning implementation much simpler

Will explore this.

  - if there is any general pushback on leveraging a kthread for this,
    this would be very good feedback to have early

This was one of the most important feedback I was looking for.
Is there any important reason why we should or should not go for kthread.

Only reason I found was overhead of scanning / overall CPU overhead.
So I brought down the overhead significantly compared to initial
implementation based (from 434s to 4s in 8GB case.) on SJ'd feedback.

This is now comparable but still current NUMAB scanning has lesser overhead.

We'll be looking to incorporate this discussion in our upstream Memory
Tiering Working Group to accelerate alignment and progress on the
approach.

+1

If you are interested in participating in this series of discussions,
please let me know in email.  Everybody is welcome to participate and
we'll have summary email threads such as this one to follow-up on the
mailing lists.

Raghu, do you have plans for your next version of the RFC?

Thanks!

Depending on how much radical change is required to current 
implementation, am hopeful to come up with next revision in mid of Jan 
or later part of Jan (considering year end holidays).

[1] https://lore.kernel.org/linux-mm/20241201153818.2633616-1-raghavendra.kt@xxxxxxx/T/#t

Thanks and Regards
- Raghu