Hello Raghavendra, Thank you for posting this nice patch series. I gave you some feedback offline. Adding those here again for transparency on this grateful public discussion. On Sun, 1 Dec 2024 15:38:08 +0000 Raghavendra K T <raghavendra.kt@xxxxxxx> wrote: > Introduction: > ============= > This patchset is an outcome of an ongoing collaboration between AMD and Meta. > Meta wanted to explore an alternative page promotion technique as they > observe high latency spikes in their workloads that access CXL memory. > > In the current hot page promotion, all the activities including the > process address space scanning, NUMA hint fault handling and page > migration is performed in the process context. i.e., scanning overhead is > borne by applications. Yet another approach is using DAMON. DAMON does access monitoring, and further allows users to request access pattern-driven system operations in name of DAMOS (Data Access Monitoring-based Operation Schemes). Using it, users can request DAMON to find hot pages and promote, while finding cold pages and demote. SK hynix has made their CXL-based memory capacity expansion solution in the way (https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion). We collaboratively developed new DAMON features for that, and those are all in the mainline since Linux v6.11. I also proposed an idea for advancing it using DAMOS auto-tuning on more general (>2 tiers) setup (https:lore.kernel.org/20231112195602.61525-1-sj@xxxxxxxxxx). I haven't had a time to further implement and test the idea so far, though. > > This is an early RFC patch series to do (slow tier) CXL page promotion. > The approach in this patchset assists/addresses the issue by adding PTE > Accessed bit scanning. > > Scanning is done by a global kernel thread which routinely scans all > the processes' address spaces and checks for accesses by reading the > PTE A bit. It then migrates/promotes the pages to the toptier node > (node 0 in the current approach). > > Thus, the approach pushes overhead of scanning, NUMA hint faults and > migrations off from process context. DAMON also uses PTE A bit as major source of the access information. And DAMON does both access monitoring and promotion/demotion in a global kernel thread, namely kdamond. Hence the DAMON-based approach would also offload the overheads from process context. So I feel your approach has a sort of similarity with DAMON-based one in a way, and we might have a chance to avoid unnecessary duplicates. [...] > > Limitations: > =========== > PTE A bit scanning approach lacks information about exact destination > node to migrate to. This is same for DAMON-based approach, since DAMON also uses PTE A bit as the major source of the information. We aim to extend DAMON to aware of the access source CPU, and use it for solving this problem, though. Utilizing page faults or AMD IBS-like h/w features are on the table of the ideas. > > Notes/Observations on design/Implementations/Alternatives/TODOs... > ================================ > 1. Fine-tuning scan throttling DAMON allows users set the upper-limit of monitoring overhead, using max_nr_regions parameter. Then it provides its best-effort accuracy. We also have ongoing projects for making it more accurate and easier to tune. > > 2. Use migrate_balanced_pgdat() to balance toptier node before migration > OR Use migrate_misplaced_folio_prepare() directly. > But it may need some optimizations (for e.g., invoke occasionaly so > that overhead is not there for every migration). > > 3. Explore if a separate PAGE_EXT flag is needed instead of reusing > PAGE_IDLE flag (cons: complicates PTE A bit handling in the system), > But practically does not look good idea. > > 4. Use timestamp information-based migration (Similar to numab mode=2). > instead of migrating immediately when PTE A bit set. > (cons: > - It will not be accurate since it is done outside of process > context. > - Performance benefit may be lost.) DAMON provides a sort of time-based aggregated monitoring results. And DAMOS provides prioritization of pages based on the access temperature. Hence, DAMON-based apparoach can also be used for a similar purpose (promoting not every accessed pages but pages that more frequently used for longer time). > > 5. Explore if we need to use PFN information + hash list instead of > simple migration list. Here scanning is directly done with PFN belonging > to CXL node. DAMON supports physical address space monitoring, and maintains the access monitoring results in its own data structure called damon_region. So I think similar benefit can be achieved using DAMON? [...] > 8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of > physical addresses accessed. My biased humble opinion is that it would be very nice to explore this opportunity, since I show some similarities and opportunities to solve some of challenges on your approach in an easier way. Even if it turns out that DAMON cannot be used for your use case, failing earlier is a good thing, I'd say :) > > 9. Gregory has nicely mentioned some details/ideas on different approaches in > [1] : development notes, in the context of promoting unmapped page cache folios. DAMON supports monitoring accesses to unmapped page cache folios, so hopefully DAMON-based approaches can also solve this issue. > > 10. SJ had pointed about concerns about kernel-thread based approaches as in > kstaled [2]. So current patchset has tried to address the issue with simple > algorithms to reduce CPU overhead. Migration throttling, Running the daemon > in NICE priority, Parallelizing migration with scanning could help further. > > 11. Toptier pages scanned can be used to assist current NUMAB by providing information > on hot VMAs. > > Credits > ======= > Thanks to Bharata, Joannes, Gregory, SJ, Chris for their valuable comments and > support. I also learned many things from the great discussions, thank you :) [...] > > Links: > [1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@xxxxxxxxxx/ > [2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@xxxxxxxxxx/#r > [3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > I might have CCed more people or less people than needed > unintentionally. Thanks, SJ [...]