Re: Slow-tier Page Promotion discussion recap and open questions

Gregory Price <gourry@xxxxxxxxxx> · Wed, 18 Dec 2024 19:56:19 -0500

On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote:
> ----->o-----
> Raghu noted the current promotion destination is node 0 by default.  Wei
> noted we could get some page owner information to determine things like
> mempolicies or compute the distance between nodes and, if multiple nodes
> have the same distance, choose one of them just as we do for demotions.
> 
> Gregory Price noted some downsides to using mempolicies for this based on
> per-task, per-vma, and cross socket policies, so using the kernel's
> memory tiering policies is probably the best way to go about it.
> 

Slightly elaborating here:
- In an async context, associating a page with a specific task is not
  presently possible (that I know of). The most we know is the last
  accessing CPU - maybe - in the page/folio struct.  Right now this
  is disabled in favor of a timestamp when tiering is enabled.

  a process with 2 tasks which have access to the page may not run
  on the same socket, so we run the risk of migrating to a bad target.
  Best effort here would suggest either socket is fine - since they're
  both "fast nodes" - but this requires that we record the last 
  accessing CPU for a page at identification time.

- Even if you could associate with a particular task, the task and/or
  cgroup are not guaranteed to have a socket affinity. Though obviously
  if it does, that can be used (just doesn't satisfy default behavior).
  Basically just saying we shouldn't depend on this

- per-vma mempolicies are a potential solution, but they're not very
  common in the wild - software would have to become numa aware and
  utilize mbind() on particular memory regions.
  Likewise we shouldn't depend on this either.

- This holds for future mechanisms like CHMU, whose accessing data is
  even more abstract (no concept of accessing task / cpu / owner at all)

More generally - in an async scanning context it's presently not
possible to identify the optimal promotion node - and it likely is
not possible without userland hints.

So probably we should just leverage static configuration data (HMAT)
and some basic math to put together a promotion target in a similar
way to how we calculate a demotion target.

Long winded way of saying I don't think an optimal solution is possible,
so lets start with suboptimal and get data.

> ----->o-----
> My takeaways:
> 
>  - there is a definite need to separate hot page detection and the
>    promotion path since hot pages may be derived from multiple sources,
>    including hardware assists in the future
> 
>  - for the hot page tracking itself, a common abstraction to be used that
>    can effectively describe hotness regardless of the backend it is
>    deriving its information from would likely be quite useful
>

In a synchronous context (Accessing Task), something like:

target_node = numa_node_id; # cpu we're currently operating on
promote_pagevec(vec, numa_node_id, PROMOTE_DEFER);

where the function promotion logic then does something like:

promote_batch(pagevec, target)

In an asynchronous context (Scanning Task), something like:

promote_pagevec(vec, -1, PROMOTE_DEFER);

where the promotion logic then does something like

for page in pagevec:
	target = memory_tiers_promotion_target(page_to_nid(page))
	promote(folio, target)

Plumbing-wise this can be optimized to identify similarly located
pages into a sub-pagevec and use promote_batch() semantics.

My gut says this is the best we're going to get, since async contexts
can't identify accessor locations easily (especially CHMU).

>  - I think virtual memory scanning is likely the only viable approach for

Hard disagree.  Virtual memory scanning misses an entire class of memory

Unmapped file cache.
https://lore.kernel.org/linux-mm/20241210213744.2968-1-gourry@xxxxxxxxxx/

>    this purpose and we could store state in the underlying struct page,

This is contentious. Look at folio->_last_cpupid for context, we're
already overloading fields in subtle ways to steal a 32 bit area.

>    similar to NUMA Balancing, but that all scanning should be driven by
>    walking the mm_struct's to harvest the Accessed bit
> 

If the goal is to do multi-tenant tiering (i.e. many mm_struct's), then
this scales poorly by design.

Elsewhere, folks agreed that CXL-memory will have HMU-driven hotness
data as the primary mechanism.  This is a physical-memory hotness tracking
mechanism that avoids scanning page tables or page structs.

If we think that's the direction it's going, then we shouldn't bother
investing a ton of effort into a virtual-memory driven design as the
primary user.  (Sure, support it, but don't dive too much further in)

>  - if there is any general pushback on leveraging a kthread for this,
>    this would be very good feedback to have early
>

I think for the promotion system, having one or more kthreads based on
promotion pressure is a good idea.

I'm not sure how well this will scale for many-process, high-memory
systems (1TB+ on a scanning interval of 256MB is very low accuracy).

Need more data!

~Gregory