Re: Slow-tier Page Promotion discussion recap and open questions

David Rientjes <rientjes@xxxxxxxxxx> · Sun, 29 Dec 2024 21:36:41 -0800 (PST)

On Thu, 26 Dec 2024, Karim Manaouil wrote:

> On Wed, Dec 18, 2024 at 07:56:19PM -0500, Gregory Price wrote:
> > On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote:
> > > ----->o-----
> > > Raghu noted the current promotion destination is node 0 by default.  Wei
> > > noted we could get some page owner information to determine things like
> > > mempolicies or compute the distance between nodes and, if multiple nodes
> > > have the same distance, choose one of them just as we do for demotions.
> > > 
> > > Gregory Price noted some downsides to using mempolicies for this based on
> > > per-task, per-vma, and cross socket policies, so using the kernel's
> > > memory tiering policies is probably the best way to go about it.
> > > 
> > 
> > Slightly elaborating here:
> > - In an async context, associating a page with a specific task is not
> >   presently possible (that I know of). The most we know is the last
> >   accessing CPU - maybe - in the page/folio struct.  Right now this
> >   is disabled in favor of a timestamp when tiering is enabled.
> > 
> >   a process with 2 tasks which have access to the page may not run
> >   on the same socket, so we run the risk of migrating to a bad target.
> >   Best effort here would suggest either socket is fine - since they're
> >   both "fast nodes" - but this requires that we record the last 
> >   accessing CPU for a page at identification time.
> > 
> 
> This can be sovled with a two steps migration: first, you promote the
> page from CXL to a NUMA node, then you rely on NUMA balancing to
> further place the page into the right NUMA node. NUMA hint faults can
> still be enabled for pages allocated from NUMA nodes, but not for CXL.
> 

I think it would be a shame to promote to the wrong top-tier NUMA node and 
rely on NUMA Balancing to fix it up with yet another migration :/

Since these cpuless memory nodes should have a promotion node associated 
with them, which defaults to the latency given to us by the HMAT, can we 
make that the default promotion target when memory is accessed?  The 
"normal mode" for NUMA Balancing could fix this up subsequent to the 
promotion, but only if enabled.

Raghu noted in the session that the current patch series only promotes to 
node 0 but that choice is only for the RFC.  I *assume* that every CXL 
memory node will have a standard top-tier node to promote to *or* that we 
stash that promotion node information at the time of demotion so memory 
comes back to the same node it was demoted from.

Either way, this feels like a solvable problem?