Slow-tier Page Promotion discussion recap and open questions

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 17 Dec 2024 20:19:56 -0800 (PST)

Hi everybody,

We had a very interactive discussion last week led by RaghavendraKT on
slow-tier page promotion intended for memory tiering platforms, thank
you!  Thanks as well to everybody who attended and provided great
questions, suggestions, and feedback.

The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
is a proposal to allow for asynchronous page promotion based on memory
accesses as an alternative to NUMA Balancing based promotions.  There was
widespread interest in this topic and the discussion surfaced multiple
use cases and requirements, very focused on CXL use cases.

----->o-----
Raghu noted that the current approach utilizing NUMA Balancing focuses on
scan and *migration* in process context, which often gets observed as
latency spikes.  This led to an idea for scanning of the PTE Accessed bit
and promotion to be handled by a kthread instead.  In Raghu's proposal,
this is called kmmscand.  For every mm on the system, the vmas are
scanned and a migration list is created that feeds into page migration.

To avoid scanning the entire process address space, however, there is a
per-process scan period and scan size.  Scanning the vmas continue while
still in the scan period.  When the scan size is complete, the scanning
transitions into the migration phase.

High level, the scan period and scan size are adjusted based on the
accessed folios that were observed in the last scan.

----->o-----
I asked if this was really done single threaded, which was confirmed.  If
only a single process has pages on a slow memory tier, for example, then
flexible tuning of the scan period and size ensures we do not scan
needlessly.  The scan period can be tuned to be more responsive (down to
400ms in this proposal) depending on how many accesses we have on the
last scan; similarly, it can be much less responsive (up to 5s) if memory
is not found to be accessed.

I also asked if scanning can be disabled entirely, Raghu clarified that
it cannot be.

Wei Xu asked if the scan period should be interpreted as the minimal
interval between scans because kmmscand is single threaded and there are
many processes.  Raghu confirmed this is correct, the minimal delay.
Even if the scan period is 400ms, in reality it could be multiple seconds
based on load.

Liam Howlett asked how we could have two scans colliding in a time
segment.  Raghu noted if we are able to complete the last scan in less
time than 400ms, then we have this delay to avoid continuously scanning
that results in increased cpu overhead.  Liam further asked if processes
opt into a scan or out of the scan, Raghu noted we always scan every
process.  John Hubbard suggested that we have per-process control.

----->o-----
Zi Yan asked a great question about how this would interact with LRU
information used for page reclaim.  The scanning could interfere with
cold page detection because it manipulates the Accessed bits.

Wei noted that the kernel leverages page_young for this so during
scanning we need to transfer the Accessed bit information into
page_young.  This is what idle page tracking currently uses to not
interfere with anything that harvests the Accessed bit.  The scan only
cares about the Accessed bit.

Zi asked how this would be handled if processes are allowed to opt out,
in other words, if some processes are propagating their Accessed bits to
page_young and others are not.  Wei clarified that for page reclaim, the
Accessed bit and page_young should both be checked and are treated
equally.

Wei noted a subtlety here where MGLRU does not currently check
page_young.  Since multiple users of the Accessed bit exist, MGLRU should
likely check page_young as well.

Bharata B Rao noted this is equivalent to how idle page tracking handles
this behavior as well as DAMON.

----->o-----
John Hubbard suggested that this scanning may very well be multi-threaded
and there's no explicit reason to avoid it.  (I didn't bring it up at the
time, but I think this is required just for NUMA purposes.)  Otherwise it
won't scale well.  Raghu noted we have a global mm list, but will think
about this for future iterations.

----->o-----
Raghu noted the current promotion destination is node 0 by default.  Wei
noted we could get some page owner information to determine things like
mempolicies or compute the distance between nodes and, if multiple nodes
have the same distance, choose one of them just as we do for demotions.

Gregory Price noted some downsides to using mempolicies for this based on
per-task, per-vma, and cross socket policies, so using the kernel's
memory tiering policies is probably the best way to go about it.

----->o-----
Wei asked about benchmark results and why migration time was reduced
given the same amount of memory to migrate.  Raghu noted the only
difference was the migration path, so things like kswapd or page
allocation did not spend a lot of time trying to reclaim memory for the
migration to succeed.  This can happen if migrating to a nearly full
target NUMA node.

Raghu also noted that the migration time is not exactly comparable
between NUMA Balancing and kmmscand.  We're also not tracking things like
timestamp and storing state to migrate after multiple accesses.  Zi also
noted that migrating batched memory has some optimizations especially for
tlb shootdowns.

----->o-----
Wei noted an important point about separating hot page detection and
promotion, which don't actually need to be coupled at all.  This uses
page table scanning while future support may not need to leverage this at
all.  We'd very much like to avoid multiple promotion solutions for
different ways to track page hotness.

I strongly supported this because I believe for CXL, at least within the
next three years, that memory hotness will likely not be derived from
page table Accessed bit scanning.  Zi Yan agreed.

The promotion path may also want to be much less aggressive than on first
access.  Raghu showed many improvements, including handling short lived
processes, more accurate hot page detection using timestamp, etc.

----->o-----
I asked about offloading the migration to a data mover, such as the PSP
for AMD, DMA engine, etc and whether that should be treated entirely
separately as a topic.  Bharata said there was a proof-of-concept
available from AMD that does just that but the initial results were not
that encouraging.

Zi asked if the DMA engine saturated the link between the slow and fast
tiers.  If we want to offload to a copy engine, we need to verify that
the throughput is sufficient or we may be better off using idle cpus to
perform the migration for us.

----->o-----
I followed up on a discussion point early in the talk about whether this
should be virtual address scanning like the current approach, walking
mm_struct's, or the alternative approach which would be physical address
scanning.

Raghu sees this as a fully alternative approach such as what DAMON uses
that is based on rmap.  The only advantage appears to be avoiding
scanning on top tier memory completely.

----->o-----
Wei noted there was a lot of similarities between the RFC implementation
and the MGLRU page walk functionality and whether it would make sense to
try to converge these together or make more generally useful.

SeongJae noted that if DAMON logic were used for the scanning that we
could re-use the existing support for controlling the overhead.

John echoed the idea of leveraging the learnings from MGLRU in this,
additionally for trying to get more use of MGLRU.  Wei noted there are
MGLRU optimizations that we can leverage such as when the pmd Accessed
bit is clear we don't need to iterate any further for that scan.

----->o-----
My takeaways:

 - the memory tiering discussion that I led at LSF/MM/BPF this year also
   focused on asynchronous memory migration, decoupled from NUMA
   Balancing and I strongly believe this is the right direction

 - the per-process control seems important and with no obvious downsides
   as John noted, so likely better to ensure that some processes can opt
   out of scanning with a prctl()

 - it likely makes sense for MGLRU to also check page_young as Wei noted
   so this deals with the transfer of the Accessed bit to page_young
   evenly for all processes, even when opting out

 - we likely want to reconsider the single threaded nature of the kthread
   even if only for NUMA purposes

 - using node 0 for all target migrations is only for illustrative
   purposes, this will definitely need to be more thought out such as
   using the kernel's understanding of the memory tiers on the system as
   Gregory pointed out

 - we want to ensure that the promotion node is a very reasonable
   destination target, it would be unfortunate to rely on NUMA Balancing
   to then migrate memory again once it's promoted to get the proper
   affinity :)

 - promotion on first access will likely need to be reconsidered, which
   is not even used by NUMA Balancing.  We'll likely need to store some
   state to promote memory that is repeatedly being accessed as opposed
   to treating a single access as though the memory must be promoted

 - there is a definite need to separate hot page detection and the
   promotion path since hot pages may be derived from multiple sources,
   including hardware assists in the future

 - for the hot page tracking itself, a common abstraction to be used that
   can effectively describe hotness regardless of the backend it is
   deriving its information from would likely be quite useful

 - I think virtual memory scanning is likely the only viable approach for
   this purpose and we could store state in the underlying struct page,
   similar to NUMA Balancing, but that all scanning should be driven by
   walking the mm_struct's to harvest the Accessed bit

 - re-using the MGLRU page walk implementation would likely make the
   kmmscand scanning implementation much simpler

 - if there is any general pushback on leveraging a kthread for this,
   this would be very good feedback to have early

We'll be looking to incorporate this discussion in our upstream Memory
Tiering Working Group to accelerate alignment and progress on the
approach.

If you are interested in participating in this series of discussions,
please let me know in email.  Everybody is welcome to participate and
we'll have summary email threads such as this one to follow-up on the
mailing lists.

Raghu, do you have plans for your next version of the RFC?

Thanks!

[1] https://lore.kernel.org/linux-mm/20241201153818.2633616-1-raghavendra.kt@xxxxxxx/T/#t