Re: Slow-tier Page Promotion discussion recap and open questions

Raghavendra K T <rkodsara@xxxxxxx> · Mon, 30 Dec 2024 12:21:26 +0530

On 12/30/2024 11:06 AM, David Rientjes wrote:
On Thu, 26 Dec 2024, Karim Manaouil wrote:

On Wed, Dec 18, 2024 at 07:56:19PM -0500, Gregory Price wrote:
On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote:
----->o-----
Raghu noted the current promotion destination is node 0 by default.  Wei
noted we could get some page owner information to determine things like
mempolicies or compute the distance between nodes and, if multiple nodes
have the same distance, choose one of them just as we do for demotions.

Gregory Price noted some downsides to using mempolicies for this based on
per-task, per-vma, and cross socket policies, so using the kernel's
memory tiering policies is probably the best way to go about it.

Slightly elaborating here:
- In an async context, associating a page with a specific task is not
   presently possible (that I know of). The most we know is the last
   accessing CPU - maybe - in the page/folio struct.  Right now this
   is disabled in favor of a timestamp when tiering is enabled.

   a process with 2 tasks which have access to the page may not run
   on the same socket, so we run the risk of migrating to a bad target.
   Best effort here would suggest either socket is fine - since they're
   both "fast nodes" - but this requires that we record the last
   accessing CPU for a page at identification time.

This can be sovled with a two steps migration: first, you promote the
page from CXL to a NUMA node, then you rely on NUMA balancing to
further place the page into the right NUMA node. NUMA hint faults can
still be enabled for pages allocated from NUMA nodes, but not for CXL.

I think it would be a shame to promote to the wrong top-tier NUMA node and
rely on NUMA Balancing to fix it up with yet another migration :/

Agree here. Advantage of promotion is lost, considering the typical
access time for CXL vs regular node we have currently.

Since these cpuless memory nodes should have a promotion node associated
with them, which defaults to the latency given to us by the HMAT, can we
make that the default promotion target when memory is accessed?  The
"normal mode" for NUMA Balancing could fix this up subsequent to the
promotion, but only if enabled.

Raghu noted in the session that the current patch series only promotes to
node 0 but that choice is only for the RFC.  I *assume* that every CXL
memory node will have a standard top-tier node to promote to *or* that we
stash that promotion node information at the time of demotion so memory
comes back to the same node it was demoted from.

Either way, this feels like a solvable problem?

How about sharing the hint between NUMAB mode=1 and kernel thread. For
e.g., NUMAB mode=1 needs help on hot VMAs to scan. (which is supplied
from kernel thread) whereas promotion target is kept at VMA level as a
hint based on hint faults?? (Thinking loud here).

Even top-tier node associated CXL might work, but need to think more here.

PS: I had run my experiment with NUMAB mode=1 the benefit of kernel
thread was intact.

Thanks and Regards
- Raghu