On Mon Dec 30, 2024 at 12:30 AM EST, David Rientjes wrote: > On Thu, 19 Dec 2024, Shivank Garg wrote: > > > On 12/18/2024 8:20 PM, Zi Yan wrote: > > > On 17 Dec 2024, at 23:19, David Rientjes wrote: > > > > > >> Hi everybody, > > >> > > >> We had a very interactive discussion last week led by RaghavendraKT on > > >> slow-tier page promotion intended for memory tiering platforms, thank > > >> you! Thanks as well to everybody who attended and provided great > > >> questions, suggestions, and feedback. > > >> > > >> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1] > > >> is a proposal to allow for asynchronous page promotion based on memory > > >> accesses as an alternative to NUMA Balancing based promotions. There was > > >> widespread interest in this topic and the discussion surfaced multiple > > >> use cases and requirements, very focused on CXL use cases. > > >> > > > <snip> > > >> ----->o----- > > >> I asked about offloading the migration to a data mover, such as the PSP > > >> for AMD, DMA engine, etc and whether that should be treated entirely > > >> separately as a topic. Bharata said there was a proof-of-concept > > >> available from AMD that does just that but the initial results were not > > >> that encouraging. > > >> > > >> Zi asked if the DMA engine saturated the link between the slow and fast > > >> tiers. If we want to offload to a copy engine, we need to verify that > > >> the throughput is sufficient or we may be better off using idle cpus to > > >> perform the migration for us. > > > > > > <snip> > > >> > > >> - we likely want to reconsider the single threaded nature of the kthread > > >> even if only for NUMA purposes > > >> > > > > > > Related to using DMA engine and/or multi threads for page migration, I had > > > a patchset accelerating page migration[1] back in 2019. It showed good > > > throughput speedup, ~4x using 16 threads to copy multiple 2MB THP. I think > > > it is time to revisit the topic. > > > > > > > > > [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/ > > > > Hi All, > > > > I wanted to provide some additional context regarding the AMD DMA offloading > > POC mentioned by Bharata: > > https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@xxxxxxx > > > > While the initial results weren't as encouraging as hoped, I plan to improve this > > in next versions of the patchset. > > > > The core idea in my RFC patchset is restructuring the folio move operation > > to better leverage DMA hardware. Instead of the current folio-by-folio approach: > > > > for_each_folio() { > > copy metadata + content + update PTEs > > } > > > > We batch the operations to minimize overhead: > > > > for_each_folio() { > > copy metadata > > } > > DMA batch copy all content > > for_each_folio() { > > update PTEs > > } > > > > My experiment showed that folio copy can consume up to 26.6% of total migration > > cost when moving data between NUMA nodes. This suggests significant room for > > improvement through DMA offloading, particularly for the larger transfers expected > > in CXL scenarios. > > > > It would be interesting work on combining these approaches for optimized page > > promotion. > > > > This is very exciting, thanks Shivank and Zi! The reason I brought this > topic up during the session on asynchronous page promotion for memory > tiering was because page migration is likely going to become *much* more > popular and will be in the critical path under system-wide memory > pressure. Hardware assist and any software optimizations that can go > along with it would certainly be very interesting to discuss. > > Shivank, do you have an estimated timeline for when that patch series will > be refreshed? Any planned integration with TMPM? > > Zi, are you looking to refresh your series and continue discussing page > migration offload? We could set up another Linux MM Alignment Session > topic focused exactly on this and get representatives from the vendors > involved. Sure. I am redoing the experiments with multithreads recently and see more throughput increase (up to 10x througput with 32 threads) on NVIDIA Grace CPUs. Shivank's approach, using MIGRATE_SYNC_NO_COPY, looks simpler than what I have done, splitting migrate_folio() into two parts[1]. I am planning to rebuild my multithreaded folio copy patches on top of Shivank's patches with some modifications. One thing to note is that MIGRATE_SYNC_NO_COPY is removed by Kefeng (cc'd) recently[2], so I will need to bring it back. [1] https://github.com/x-y-z/linux-dev/tree/batched_page_migration_copy-v6.12 [2] https://lore.kernel.org/all/20240524052843.182275-6-wangkefeng.wang@xxxxxxxxxx/ -- Best Regards, Yan, Zi