Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Thu, 6 Feb 2025 15:30:06 +0000

On Wed, 5 Feb 2025 11:05:29 -0500
Johannes Weiner <hannes@xxxxxxxxxxx> wrote:

> On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote:
> > On 31-Jan-25 6:39 PM, Jonathan Cameron wrote:  
> > > On Fri, 31 Jan 2025 12:28:03 +0000
> > > Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote:
> > >   
> > >>> Here is the list of potential discussion points:  
> > >> ...
> > >>  
> > >>> 2. Possibility of maintaining single source of truth for page hotness that would
> > >>> maintain hot page information from multiple sources and let other sub-systems
> > >>> use that info.  
> > >> Hi,
> > >>
> > >> I was thinking of proposing a separate topic on a single source of hotness,
> > >> but this question covers it so I'll add some thoughts here instead.
> > >> I think we are very early, but sharing some experience and thoughts in a
> > >> session may be useful.  
> > > 
> > > Thinking more on this over lunch, I think it is worth calling this out as a
> > > potential session topic in it's own right rather than trying to find
> > > time within other sessions.  Hence the title change.
> > > 
> > > I think a session would start with a brief listing of the temperature sources
> > > we have and those on the horizon to motivate what we are unifying, then
> > > discussion to focus on need for such a unification + requirements
> > > (maybe with a straw man).  
> > 
> > Here is a compilation of available temperature sources and how the 
> > hot/access data is consumed by different subsystems:  
> 
> This is super useful, thanks for collecting this.

Absolutely agree!

> 
> > PA-Physical address available
> > VA-Virtual address available
> > AA-Access time available
> > NA-accessing Node info available
> > 
> > I have left the slot blank for those which I am not sure about.
> > ==================================================
> > Temperature		PA	VA	AA	NA
> > source
> > ==================================================
> > PROT_NONE faults	Y	Y	Y	Y
> > --------------------------------------------------
> > folio_mark_accessed()	Y		Y	Y
> > --------------------------------------------------  
> 
> For fma(), the VA info is available in unmap, but usually it isn't -
> or doesn't meaningfully exist, as in the case of unmapped buffered IO.
> 
> I'd say it's an N.
> 
> > PTE A bit		Y	Y	N	N
> > --------------------------------------------------
> > Platform hints		Y	Y	Y	Y
> > (AMD IBS)
> > --------------------------------------------------
> > Device hints		Y
> > (CXL HMU)
> > ==================================================  

For the use cases where we have relatively few 'pages' the cost of a reverse
map look up doesn't look to be a problem.  Trick is to do it
only after we've done what we can in PA space to cut down on the
pages of interest. So maybe (Y) to reflect that it is indirect.
Whether it makes sense to do that before or after some common
layer is an interesting question.  That PA/VA mapping might be
out of date anyway by the time we see the data.

> 
> For the following table, it might be useful to add *when* the source
> produces this information. Sampling frequency is a likely challenge:
> consumers have different requirements, and overhead should be limited
> to the minimum required to serve enabled consumers.
> 
> Here is an (incomplete) attempt - sorry about the long lines:
> 
> > And here is an attempt to compile how different subsystems
> > use the above data:
> > ==============================================================
> > Source			Subsystem		Consumption         Activation/Frequency
> > ==============================================================
> > PROT_NONE faults	NUMAB		NUMAB=1 locality based              While task is running,
> > via process pgtable			balancing                           rate varies on observed
> > walk					NUMAB=2 hot page                    locality and sysctl knobs.
> > 					promotion
> > ==============================================================
> > folio_mark_accessed()	FS/filemap/GUP	LRU list activation                 On cache access and unmap
> > ==============================================================
> > PTE A bit via		Reclaim:LRU	LRU list activation,	            During memory pressure
> > rmap walk				deactivation/demotion
> > ==============================================================
> > PTE A bit via		Reclaim:MGLRU	LRU list activation,	            - During memory pressure
> > rmap walk and process			deactivation/demotion               - Continuous sampling (configurable)
> > pgtable walk                                                                for workingset reporting
> > ==============================================================
> > PTE A bit via		DAMON		LRU activation,                     Continuous sampling (configurable)?
> > rmap walk				hot page promotion,                 (I believe SJ is looking into
> > 					demotion etc                         auto-tuning this).
> > ==============================================================
> > Platform hints		NUMAB		NUMAB=1 Locality based
> > (AMD IBS)				balancing and
> > 					NUMAB=2 hot page
> > 					promotion
> > ==============================================================
Based on the CXL one...

> > Device hints		NUMAB		NUMAB=2 hot page       Continuous sampling, frequency controllable.
> > 					promotion                      Subsampling programable.
> > ==============================================================
> > The last two are listed as possibilities.
> > 
> > Feel free to correct/clarify and add more.

The above covers what the use cases require. Maybe we need to do similar
for the controls needed the other way (frequency already covered)

Filtering.
* Process ID
* Address range (PA / VA)
* Access type (read vs write) may matter for migration cost.

Also frequency is more nuanced perhaps:
- How often to give data (timeliness)
- How much data to give (bandwidth)
- When don't I care (threshold)
- How precise do I want it to be (subsampling etc)

The layering is clearly to be complex, so maybe addressing each
use case for what info that needs would be helpful?

The following is probably too simplistic.

==================================================================
Usecase       Nature of data
==================================================================
NUMAB =1      Enough hot pages with remote source.
Balancing
==================================================================
NUMAB =2      Enough hot pages in slow memory
Tiering
Promotion
==================================================================
NUMAB = 2     Enough cold pages in fast memory
Tiering
Demotion
===================================================================
LRU list      Specific pages of interest accessed
activation
===================================================================
LRU list      Enough cold pages?
deactivation
====================================================================

Jonathan
> > 
> > Regards,
> > Bharata.