Hi everybody, We had a very interactive discussion last week led by RaghavendraKT on slow-tier page promotion intended for memory tiering platforms, thank you! Thanks as well to everybody who attended and provided great questions, suggestions, and feedback. The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1] is a proposal to allow for asynchronous page promotion based on memory accesses as an alternative to NUMA Balancing based promotions. There was widespread interest in this topic and the discussion surfaced multiple use cases and requirements, very focused on CXL use cases. ----->o----- Raghu noted that the current approach utilizing NUMA Balancing focuses on scan and *migration* in process context, which often gets observed as latency spikes. This led to an idea for scanning of the PTE Accessed bit and promotion to be handled by a kthread instead. In Raghu's proposal, this is called kmmscand. For every mm on the system, the vmas are scanned and a migration list is created that feeds into page migration. To avoid scanning the entire process address space, however, there is a per-process scan period and scan size. Scanning the vmas continue while still in the scan period. When the scan size is complete, the scanning transitions into the migration phase. High level, the scan period and scan size are adjusted based on the accessed folios that were observed in the last scan. ----->o----- I asked if this was really done single threaded, which was confirmed. If only a single process has pages on a slow memory tier, for example, then flexible tuning of the scan period and size ensures we do not scan needlessly. The scan period can be tuned to be more responsive (down to 400ms in this proposal) depending on how many accesses we have on the last scan; similarly, it can be much less responsive (up to 5s) if memory is not found to be accessed. I also asked if scanning can be disabled entirely, Raghu clarified that it cannot be. Wei Xu asked if the scan period should be interpreted as the minimal interval between scans because kmmscand is single threaded and there are many processes. Raghu confirmed this is correct, the minimal delay. Even if the scan period is 400ms, in reality it could be multiple seconds based on load. Liam Howlett asked how we could have two scans colliding in a time segment. Raghu noted if we are able to complete the last scan in less time than 400ms, then we have this delay to avoid continuously scanning that results in increased cpu overhead. Liam further asked if processes opt into a scan or out of the scan, Raghu noted we always scan every process. John Hubbard suggested that we have per-process control. ----->o----- Zi Yan asked a great question about how this would interact with LRU information used for page reclaim. The scanning could interfere with cold page detection because it manipulates the Accessed bits. Wei noted that the kernel leverages page_young for this so during scanning we need to transfer the Accessed bit information into page_young. This is what idle page tracking currently uses to not interfere with anything that harvests the Accessed bit. The scan only cares about the Accessed bit. Zi asked how this would be handled if processes are allowed to opt out, in other words, if some processes are propagating their Accessed bits to page_young and others are not. Wei clarified that for page reclaim, the Accessed bit and page_young should both be checked and are treated equally. Wei noted a subtlety here where MGLRU does not currently check page_young. Since multiple users of the Accessed bit exist, MGLRU should likely check page_young as well. Bharata B Rao noted this is equivalent to how idle page tracking handles this behavior as well as DAMON. ----->o----- John Hubbard suggested that this scanning may very well be multi-threaded and there's no explicit reason to avoid it. (I didn't bring it up at the time, but I think this is required just for NUMA purposes.) Otherwise it won't scale well. Raghu noted we have a global mm list, but will think about this for future iterations. ----->o----- Raghu noted the current promotion destination is node 0 by default. Wei noted we could get some page owner information to determine things like mempolicies or compute the distance between nodes and, if multiple nodes have the same distance, choose one of them just as we do for demotions. Gregory Price noted some downsides to using mempolicies for this based on per-task, per-vma, and cross socket policies, so using the kernel's memory tiering policies is probably the best way to go about it. ----->o----- Wei asked about benchmark results and why migration time was reduced given the same amount of memory to migrate. Raghu noted the only difference was the migration path, so things like kswapd or page allocation did not spend a lot of time trying to reclaim memory for the migration to succeed. This can happen if migrating to a nearly full target NUMA node. Raghu also noted that the migration time is not exactly comparable between NUMA Balancing and kmmscand. We're also not tracking things like timestamp and storing state to migrate after multiple accesses. Zi also noted that migrating batched memory has some optimizations especially for tlb shootdowns. ----->o----- Wei noted an important point about separating hot page detection and promotion, which don't actually need to be coupled at all. This uses page table scanning while future support may not need to leverage this at all. We'd very much like to avoid multiple promotion solutions for different ways to track page hotness. I strongly supported this because I believe for CXL, at least within the next three years, that memory hotness will likely not be derived from page table Accessed bit scanning. Zi Yan agreed. The promotion path may also want to be much less aggressive than on first access. Raghu showed many improvements, including handling short lived processes, more accurate hot page detection using timestamp, etc. ----->o----- I asked about offloading the migration to a data mover, such as the PSP for AMD, DMA engine, etc and whether that should be treated entirely separately as a topic. Bharata said there was a proof-of-concept available from AMD that does just that but the initial results were not that encouraging. Zi asked if the DMA engine saturated the link between the slow and fast tiers. If we want to offload to a copy engine, we need to verify that the throughput is sufficient or we may be better off using idle cpus to perform the migration for us. ----->o----- I followed up on a discussion point early in the talk about whether this should be virtual address scanning like the current approach, walking mm_struct's, or the alternative approach which would be physical address scanning. Raghu sees this as a fully alternative approach such as what DAMON uses that is based on rmap. The only advantage appears to be avoiding scanning on top tier memory completely. ----->o----- Wei noted there was a lot of similarities between the RFC implementation and the MGLRU page walk functionality and whether it would make sense to try to converge these together or make more generally useful. SeongJae noted that if DAMON logic were used for the scanning that we could re-use the existing support for controlling the overhead. John echoed the idea of leveraging the learnings from MGLRU in this, additionally for trying to get more use of MGLRU. Wei noted there are MGLRU optimizations that we can leverage such as when the pmd Accessed bit is clear we don't need to iterate any further for that scan. ----->o----- My takeaways: - the memory tiering discussion that I led at LSF/MM/BPF this year also focused on asynchronous memory migration, decoupled from NUMA Balancing and I strongly believe this is the right direction - the per-process control seems important and with no obvious downsides as John noted, so likely better to ensure that some processes can opt out of scanning with a prctl() - it likely makes sense for MGLRU to also check page_young as Wei noted so this deals with the transfer of the Accessed bit to page_young evenly for all processes, even when opting out - we likely want to reconsider the single threaded nature of the kthread even if only for NUMA purposes - using node 0 for all target migrations is only for illustrative purposes, this will definitely need to be more thought out such as using the kernel's understanding of the memory tiers on the system as Gregory pointed out - we want to ensure that the promotion node is a very reasonable destination target, it would be unfortunate to rely on NUMA Balancing to then migrate memory again once it's promoted to get the proper affinity :) - promotion on first access will likely need to be reconsidered, which is not even used by NUMA Balancing. We'll likely need to store some state to promote memory that is repeatedly being accessed as opposed to treating a single access as though the memory must be promoted - there is a definite need to separate hot page detection and the promotion path since hot pages may be derived from multiple sources, including hardware assists in the future - for the hot page tracking itself, a common abstraction to be used that can effectively describe hotness regardless of the backend it is deriving its information from would likely be quite useful - I think virtual memory scanning is likely the only viable approach for this purpose and we could store state in the underlying struct page, similar to NUMA Balancing, but that all scanning should be driven by walking the mm_struct's to harvest the Accessed bit - re-using the MGLRU page walk implementation would likely make the kmmscand scanning implementation much simpler - if there is any general pushback on leveraging a kthread for this, this would be very good feedback to have early We'll be looking to incorporate this discussion in our upstream Memory Tiering Working Group to accelerate alignment and progress on the approach. If you are interested in participating in this series of discussions, please let me know in email. Everybody is welcome to participate and we'll have summary email threads such as this one to follow-up on the mailing lists. Raghu, do you have plans for your next version of the RFC? Thanks! [1] https://lore.kernel.org/linux-mm/20241201153818.2633616-1-raghavendra.kt@xxxxxxx/T/#t