Re: Slow-tier Page Promotion discussion recap and open questions

Shivank Garg <shivankg@xxxxxxx> · Thu, 19 Dec 2024 12:08:05 +0530

On 12/18/2024 8:20 PM, Zi Yan wrote:
> On 17 Dec 2024, at 23:19, David Rientjes wrote:
> 
>> Hi everybody,
>>
>> We had a very interactive discussion last week led by RaghavendraKT on
>> slow-tier page promotion intended for memory tiering platforms, thank
>> you!  Thanks as well to everybody who attended and provided great
>> questions, suggestions, and feedback.
>>
>> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
>> is a proposal to allow for asynchronous page promotion based on memory
>> accesses as an alternative to NUMA Balancing based promotions.  There was
>> widespread interest in this topic and the discussion surfaced multiple
>> use cases and requirements, very focused on CXL use cases.
>>
> <snip>
>> ----->o-----
>> I asked about offloading the migration to a data mover, such as the PSP
>> for AMD, DMA engine, etc and whether that should be treated entirely
>> separately as a topic.  Bharata said there was a proof-of-concept
>> available from AMD that does just that but the initial results were not
>> that encouraging.
>>
>> Zi asked if the DMA engine saturated the link between the slow and fast
>> tiers.  If we want to offload to a copy engine, we need to verify that
>> the throughput is sufficient or we may be better off using idle cpus to
>> perform the migration for us.
> 
> <snip>
>>
>>  - we likely want to reconsider the single threaded nature of the kthread
>>    even if only for NUMA purposes
>>
> 
> Related to using DMA engine and/or multi threads for page migration, I had
> a patchset accelerating page migration[1] back in 2019. It showed good
> throughput speedup, ~4x using 16 threads to copy multiple 2MB THP. I think
> it is time to revisit the topic.
> 
> 
> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/

Hi All,

I wanted to provide some additional context regarding the AMD DMA offloading
POC mentioned by Bharata:
https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@xxxxxxx

While the initial results weren't as encouraging as hoped, I plan to improve this
in next versions of the patchset.

The core idea in my RFC patchset is restructuring the folio move operation
to better leverage DMA hardware. Instead of the current folio-by-folio approach:

for_each_folio() {
    copy metadata + content + update PTEs
}

We batch the operations to minimize overhead:

for_each_folio() {
    copy metadata
}
DMA batch copy all content
for_each_folio() {
    update PTEs
}

My experiment showed that folio copy can consume up to 26.6% of total migration
cost when moving data between NUMA nodes. This suggests significant room for
improvement through DMA offloading, particularly for the larger transfers expected
in CXL scenarios.

It would be interesting work on combining these approaches for optimized page
promotion.

Best regards,
Shivank Garg