Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Fri, 14 Mar 2025 15:28:50 +0000

On Thu, 6 Mar 2025 11:15:30 +0530
Bharata B Rao <bharata@xxxxxxx> wrote:

> kpromoted is a kernel daemon that accumulates hot page info
> from different sources and tries to promote pages from slow
> tiers to top tiers. One instance of this thread runs on each
> node that has CPUs.
> 

Firstly, nice work. Much easier to discuss things with an
implementation to look at.

I'm looking at this with my hardware hotness unit "hammer" in hand :)

> Subsystems that generate hot page access info can report that
> to kpromoted via this API:
> 
> int kpromoted_record_access(u64 pfn, int nid, int src,
> 			    unsigned long time)

This perhaps works as an interface for aggregating methods
that produce per access events.  Any hardware counter solution
is going to give you data that is closer to what you used for
the promotion decision.

We might need to aggregate at different levels.  So access
counting promotes to a hot list and we can inject other events
at that level.  The data I have from the CXL HMU is typically
after an epoch (period of time) these N pages were accessed more
than M times.  I can sort of map that to the internal storage
you have.

Would be good to evaluate approximate trackers on top of access
counts. I've no idea if sketches or similar would be efficient
enough (they have a bit of a write amplification problem) but
they may give good answers with much lower storage cost at the
risk of occasionally saying something is hot when it's not.

> 
> @pfn: The PFN of the memory accessed
> @nid: The accessing NUMA node ID
> @src: The temperature source (subsystem) that generated the
>       access info
> @time: The access time in jiffies
> 
> Some temperature sources may not provide the nid from which
> the page was accessed. This is true for sources that use
> page table scanning for PTE Accessed bit. Currently the toptier
> node to which such pages should be promoted to is hard coded.

For those cases (CXL HMU included) maybe we need to
consider how to fill in missing node info with at least a vague chance
of getting a reasonable target for migration.  We can always fall
back to random top tier node, or nearest one to where we are coming
from (on basis we maybe landed in this node based on a fallback
list when the top tier was under memory pressure).