Re: Slow-tier Page Promotion discussion recap and open questions

David Rientjes <rientjes@xxxxxxxxxx> · Sun, 29 Dec 2024 21:30:51 -0800 (PST)

On Thu, 19 Dec 2024, Shivank Garg wrote:

> On 12/18/2024 8:20 PM, Zi Yan wrote:
> > On 17 Dec 2024, at 23:19, David Rientjes wrote:
> > 
> >> Hi everybody,
> >>
> >> We had a very interactive discussion last week led by RaghavendraKT on
> >> slow-tier page promotion intended for memory tiering platforms, thank
> >> you!  Thanks as well to everybody who attended and provided great
> >> questions, suggestions, and feedback.
> >>
> >> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
> >> is a proposal to allow for asynchronous page promotion based on memory
> >> accesses as an alternative to NUMA Balancing based promotions.  There was
> >> widespread interest in this topic and the discussion surfaced multiple
> >> use cases and requirements, very focused on CXL use cases.
> >>
> > <snip>
> >> ----->o-----
> >> I asked about offloading the migration to a data mover, such as the PSP
> >> for AMD, DMA engine, etc and whether that should be treated entirely
> >> separately as a topic.  Bharata said there was a proof-of-concept
> >> available from AMD that does just that but the initial results were not
> >> that encouraging.
> >>
> >> Zi asked if the DMA engine saturated the link between the slow and fast
> >> tiers.  If we want to offload to a copy engine, we need to verify that
> >> the throughput is sufficient or we may be better off using idle cpus to
> >> perform the migration for us.
> > 
> > <snip>
> >>
> >>  - we likely want to reconsider the single threaded nature of the kthread
> >>    even if only for NUMA purposes
> >>
> > 
> > Related to using DMA engine and/or multi threads for page migration, I had
> > a patchset accelerating page migration[1] back in 2019. It showed good
> > throughput speedup, ~4x using 16 threads to copy multiple 2MB THP. I think
> > it is time to revisit the topic.
> > 
> > 
> > [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/
> 
> Hi All,
> 
> I wanted to provide some additional context regarding the AMD DMA offloading
> POC mentioned by Bharata:
> https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@xxxxxxx
> 
> While the initial results weren't as encouraging as hoped, I plan to improve this
> in next versions of the patchset.
> 
> The core idea in my RFC patchset is restructuring the folio move operation
> to better leverage DMA hardware. Instead of the current folio-by-folio approach:
> 
> for_each_folio() {
>     copy metadata + content + update PTEs
> }
> 
> We batch the operations to minimize overhead:
> 
> for_each_folio() {
>     copy metadata
> }
> DMA batch copy all content
> for_each_folio() {
>     update PTEs
> }
> 
> My experiment showed that folio copy can consume up to 26.6% of total migration
> cost when moving data between NUMA nodes. This suggests significant room for
> improvement through DMA offloading, particularly for the larger transfers expected
> in CXL scenarios.
> 
> It would be interesting work on combining these approaches for optimized page
> promotion.
> 

This is very exciting, thanks Shivank and Zi!  The reason I brought this 
topic up during the session on asynchronous page promotion for memory 
tiering was because page migration is likely going to become *much* more 
popular and will be in the critical path under system-wide memory 
pressure.  Hardware assist and any software optimizations that can go 
along with it would certainly be very interesting to discuss.

Shivank, do you have an estimated timeline for when that patch series will 
be refreshed?  Any planned integration with TMPM?

Zi, are you looking to refresh your series and continue discussing page 
migration offload?  We could set up another Linux MM Alignment Session 
topic focused exactly on this and get representatives from the vendors 
involved.

Thanks!