On 12/30/2024 11:00 AM, David Rientjes wrote: > On Thu, 19 Dec 2024, Shivank Garg wrote: > >> On 12/18/2024 8:20 PM, Zi Yan wrote: >>> On 17 Dec 2024, at 23:19, David Rientjes wrote: >>> >>>> Hi everybody, >>>> >>>> We had a very interactive discussion last week led by RaghavendraKT on >>>> slow-tier page promotion intended for memory tiering platforms, thank >>>> you! Thanks as well to everybody who attended and provided great >>>> questions, suggestions, and feedback. >>>> >>>> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1] >>>> is a proposal to allow for asynchronous page promotion based on memory >>>> accesses as an alternative to NUMA Balancing based promotions. There was >>>> widespread interest in this topic and the discussion surfaced multiple >>>> use cases and requirements, very focused on CXL use cases. >>>> >>> <snip> >>>> ----->o----- >>>> I asked about offloading the migration to a data mover, such as the PSP >>>> for AMD, DMA engine, etc and whether that should be treated entirely >>>> separately as a topic. Bharata said there was a proof-of-concept >>>> available from AMD that does just that but the initial results were not >>>> that encouraging. >>>> >>>> Zi asked if the DMA engine saturated the link between the slow and fast >>>> tiers. If we want to offload to a copy engine, we need to verify that >>>> the throughput is sufficient or we may be better off using idle cpus to >>>> perform the migration for us. >>> >>> <snip> >>>> >>>> - we likely want to reconsider the single threaded nature of the kthread >>>> even if only for NUMA purposes >>>> >>> >>> Related to using DMA engine and/or multi threads for page migration, I had >>> a patchset accelerating page migration[1] back in 2019. It showed good >>> throughput speedup, ~4x using 16 threads to copy multiple 2MB THP. I think >>> it is time to revisit the topic. >>> >>> >>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/ >> >> Hi All, >> >> I wanted to provide some additional context regarding the AMD DMA offloading >> POC mentioned by Bharata: >> https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@xxxxxxx >> >> While the initial results weren't as encouraging as hoped, I plan to improve this >> in next versions of the patchset. >> >> The core idea in my RFC patchset is restructuring the folio move operation >> to better leverage DMA hardware. Instead of the current folio-by-folio approach: >> >> for_each_folio() { >> copy metadata + content + update PTEs >> } >> >> We batch the operations to minimize overhead: >> >> for_each_folio() { >> copy metadata >> } >> DMA batch copy all content >> for_each_folio() { >> update PTEs >> } >> >> My experiment showed that folio copy can consume up to 26.6% of total migration >> cost when moving data between NUMA nodes. This suggests significant room for >> improvement through DMA offloading, particularly for the larger transfers expected >> in CXL scenarios. >> >> It would be interesting work on combining these approaches for optimized page >> promotion. >> > > This is very exciting, thanks Shivank and Zi! The reason I brought this > topic up during the session on asynchronous page promotion for memory > tiering was because page migration is likely going to become *much* more > popular and will be in the critical path under system-wide memory > pressure. Hardware assist and any software optimizations that can go > along with it would certainly be very interesting to discuss. > > Shivank, do you have an estimated timeline for when that patch series will > be refreshed? Any planned integration with TMPM? Hi David, It's definitely interesting for us to get it working with SDXI. I'm going to try it out. Thanks, Shivank > > Zi, are you looking to refresh your series and continue discussing page > migration offload? We could set up another Linux MM Alignment Session > topic focused exactly on this and get representatives from the vendors > involved. > > Thanks!