Re: [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE

David Hildenbrand <david@xxxxxxxxxx> · Thu, 1 Aug 2024 14:53:15 +0200

On 01.08.24 11:57, Zhangrenze wrote:
Sure, here's the Scalable Tiered Memory Control (STMC)

**Background**

In the era when artificial intelligence, big data analytics, and
machine learning have become mainstream research topics and
application scenarios, the demand for high-capacity and high-
bandwidth memory in computers has become increasingly important.
The emergence of CXL (Compute Express Link) provides the
possibility of high-capacity memory. Although CXL TYPE3 devices
can provide large memory capacities, their access speed is lower
than traditional DRAM due to hardware architecture limitations.

To enjoy the large capacity brought by CXL memory while minimizing
the impact of high latency, Linux has introduced the Tiered Memory
architecture. In the Tiered Memory architecture, CXL memory is
treated as an independent, slower NUMA NODE, while DRAM is
considered as a relatively faster NUMA NODE. Applications allocate
memory from the local node, and Tiered Memory, leveraging memory
reclamation and NUMA Balancing mechanisms, can transparently demote
physical pages not recently accessed by user processes to the slower
CXL NUMA NODE. However, when user processes re-access the demoted
memory, the Tiered Memory mechanism will, based on certain logic,
decide whether to promote the demoted physical pages back to the
fast NUMA NODE. If the promotion is successful, the memory accessed
by the user process will reside in DRAM; otherwise, it will reside in
the CXL NODE. Through the Tiered Memory mechanism, Linux balances
betweenlarge memory capacity and latency, striving to maintain an
equilibrium for applications.

**Problem**
Although Tiered Memory strives to balance between large capacity and
latency, specific scenarios can lead to the following issues:

    1. In scenarios requiring massive computations, if data is heavily
       stored in CXL slow memory and Tiered Memory cannot promptly
       promote this memory to fast DRAM, it will significantly impact
       program performance.
    2. Similar to the scenario described in point 1, if Tiered Memory
       decides to promote these physical pages to fast DRAM NODE, but
       due to limitations in the DRAM NODE promote ratio, these physical
       pages cannot be promoted. Consequently, the program will keep
       running in slow memory.
    3. After an application finishes computing on a large block of fast
       memory, it may not immediately re-access it. Hence, this memory
       can only wait for the memory reclamation mechanism to demote it.
    4. Similar to the scenario described in point 3, if the demotion
       speed is slow, these cold pages will occupy the promotion
       resources, preventing some eligible slow pages from being
       immediately promoted, severely affecting application efficiency.

**Solution**
We propose the **Scalable Tiered Memory Control (STMC)** mechanism,
which delegates the authority of promoting and demoting memory to the
application. The principle is simple, as follows:

    1. When an application is preparing for computation, it can promote
       the memory it needs to use or ensure the memory resides on a fast
       NODE.
    2. When an application will not use the memory shortly, it can
       immediately demote the memory to slow memory, freeing up valuable
       promotion resources.

STMC mechanism is implemented through the madvise system call, providing
two new advice options: MADV_DEMOTE and MADV_PROMOTE. MADV_DEMOTE
advises demote the physical memory to the node where slow memory
resides; this advice only fails if there is no free physical memory on
the slow memory node. MADV_PROMOTE advises retaining the physical memory
in the fast memory; this advice only fails if there are no promotion
slots available on the fast memory node. Benefits brought by STMC
include:

    1. The STMC mechanism is a variant of on-demand memory management
       designed to let applications enjoy fast memory as much as possible,
       while actively demoting to slow memory when not in use, thus
       freeing up promotion slots for the NODE and allowing it to run in
       an optimized Tiered Memory environment.
    2. The STMC mechanism better balances large capacity and latency.

**Shortcomings of STMC**
The STMC mechanism requires the caller to manage memory demotion and
promotion. If the memory is not promptly demoting after an promotion,
it may cause issues similar to memory leaks
Ehm, that sounds scary. Can you elaborate what's happening here and why
it is "similar to memory leaks"?

Can you also point out why migrate_pages() is not suitable? I would
assume demote/promote is in essence simply migrating memory between nodes.

--
Cheers,

David / dhildenb

Thank you for the response. Below are my points of view. If there are any
mistakes, I appreciate your understanding:

1. In a tiered memory system, fast nodes and slow nodes act as two common
    memory pools. The system has a certain ratio limit for promotion. For
    example, a NODE may stipulate that when the available memory is less
    than 1GB or 1/4 of the node's memory, promotion are prohibited. If we
    use migrate_pages at this point, it will unrestrictedly promote slow
    pages to fast memory, which may prevent other processes’ pages that
    should have been promoted from being promoted. This is what I mean by
    occupying promotion resources.
2. As described in point 1, if we use MADV_PROMOTE to temporarily promote
    a batch of pages and do not demote them immediately after usage, it
    will occupy many promotion resources. Other hot pages that need promote
    will not be able to get promote, which will impact the performance of
    certain processes.

So, you mean, applications can actively consume "fast memory" and 
"steal" it from other applications? I assume that's what you meant with 
"memory leak".

I would really suggest to *not* call this "similar to memory leaks", in 
your own favor ;)

3. MADV_DEMOTE and MADV_PROMOTE only rely on madvise, while migrate_pages
    depends on libnuma.

Well, you can trivially call that systemcall also without libnuma ;) So 
that shouldn't really make a difference and is rather something that can 
be solved in user space.

4. MADV_DEMOTE and MADV_PROMOTE provide a better balance between capacity
    and latency. They allow hot pages that need promoting to be promoted
    smoothly and pages that need demoting to be demoted immediately. This
    helps tiered memory systems to operate more rationally.

Can you summarize why something similar could not be provided by a 
library that builds up on existing functionality, such as migrate_pages? 
It could easily take a look at memory stats to reason whether a 
promotion/demotion makes sense (your example above with the memory 
distribution).

From the patch itself I read

"MADV_DEMOTE can mark a range of memory pages as cold
pages and immediately demote them to slow memory. MADV_PROMOTE can mark
a range of memory pages as hot pages and immediately promote them to
fast memory"

which sounds to me like migrate_pages / MADV_COLD might be able to 
achieve something similar.

What's the biggest difference that MADV_DEMOTE|MADV_PROMOTE can do better?

--
Cheers,

David / dhildenb