[LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Wed, 19 Mar 2025 12:47:53 +0000

Prior to LSFMM, this is an update on where the discussion has gone on list
since the original proposal back in January (which was buried in the
thread for Ragha's proposal focused on PTE A bit scanning)

v1: https://lore.kernel.org/all/20250131130901.00000dd1@xxxxxxxxxx/

Note that this is combining comments and discussion from many people and I may
well have summarized things badly + missed key details. If time allows
I'll update with a v3 when people have ripped up this straw man.

Bharata has posted code for one approach and discussion is ongoing:
https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@xxxxxxx/
This proposal overlaps with part of several other proposals, (Damon, access
bit tracking etc) but the focus is intended to be more general.

Abstract:

We have:
1) A range of different technologies tracking what may be loosely defined
as the hotness of regions of memory.
2) A set of use cases that care about this data.

Question:

Is it useful or feasible to aggregate the data from the sources (1) to some
layer before providing answers to (2)?  What should that layer look like?
What services and abstractions should it provide? Is there commonality in
what those use cases need?

By aggregate I'm not necessarily implying multiple techniques in use at
once, but more that we want one interface driven by whatever solution
is the right balance on a particular system. That balance can be affected
by hardware availability or characteristics of the system or workloa

Note that many of the hotness driven actions are painful (e.g. migration
of hot pages) and for those we need to be very sure it is a good idea
to do anything at all!

My assumption is that in at least some cases the problem will be too hard
to solve in kernel but lets consider what we can do.

On to the details:
------------------

Note: I'm ignoring the low level implementation details of each method
and how they avoid resource exhaustion, tune sampling timing (epoch length)
and what is sampled (scanning random etc) as in at least some cases that's
a problem for the lowest technique specific level.

Enumerating the cases (thanks to Bharata, Johannes, SJ and others for inputs
on this!)  Much of this is direct quotes from this thread:
https://lore.kernel.org/all/de31971e-98fc-4baf-8f4f-09d153902e2e@xxxxxxx/
(particularly Bharata's reply to my original questions)

Here is a compilation of available temperature sources and how the 
hot/access data is consumed by different subsystems:

PA-Physical address available
VA-Virtual address available
AA-Access time available
NA-accessing Node info available

==================================================
Temperature		PA	VA	AA	NA
source
==================================================
PROT_NONE faults	Y	Y	Y	Y
--------------------------------------------------
folio_mark_accessed()	Y		Y	Y
--------------------------------------------------
PTE A bit		Y	Y	N*	N
--------------------------------------------------
Platform hints		Y	Y	Y	Y
(AMD IBS)
--------------------------------------------------
Device hints		Y	N	N	N
(CXL HMU)
==================================================
* Some information available from scanning timing.
  In all cases other methods can be applied to fill in the missing data
  (rmap etc)

And here is an attempt to compile how different subsystems
use the above data:
==========================================================================================
Source			Subsystem	Consumption		Activation/Frequency
==========================================================================================
PROT_NONE faults	NUMAB		NUMAB=1 locality based	While task is running,
via process pgtable			balancing		rate varies on observed
walk					NUMAB=2 hot page	locality and sysctl knobs.
					promotion
==========================================================================================
folio_mark_accessed()	FS/filemap/GUP	LRU list activation	On cache access and unmap
==========================================================================================
PTE A bit via		Reclaim:LRU	LRU list activation,	During memory pressure
rmap walk				deactivation/demotion
==========================================================================================
PTE A bit via		Reclaim:MGLRU	LRU list activation,	- During memory pressure
rmap walk and process			deactivation/demotion	- Continuous sampling (configurable)
pgtable walk							  for workingset reporting
==========================================================================================
PTE A bit via		DAMON		LRU activation,
rmap walk				hot page promotion,
					demotion etc
==========================================================================================
Platform hints		NUMAB		NUMAB=1 Locality based
(e.g. AMD IBS)				balancing and
					NUMAB=2 hot page
					promotion
==========================================================================================
Device hints		NUMAB		NUMAB=2 hot page
(e.g. CXL HMU)				promotion
==========================================================================================
PG_young / PG_idle ?
==========================================================================================

Technique trade offs:

Why not just use one method?

- Cost of capture, cost of use.
  * Run all the time - aggregate data for stability of hotness.
  * Run occasionally to minimize cost.

- Different availability. e.g. IBS might be needed for other things,
  hardware monitors may not be available.

Straw man (based part on IBS proposal linked above)
---------------------------------------------------

Multiple sources become similar at different levels.

Taking just tiering promotion as an example and keeping in mind the golden
rule of tiered memory: Put data in the right place to start with if you
can.  So this is about when you can't: application unaware, changing memory
pressure and workload mix etc.

   _____________________     __________________
  | Sampling techniques |   | Hardware units  |
  | - Access counter,   |   | CXL HMU etc     |
  | - Trace based       |   |_________________|
  |_____________________|           |
             |                  Hot page
           Events                   |
             |                      |
   __________v___________           |
  |  Events to counts    |          |
  |  - hashtable, sketch |          |
  |    etc               |          |
  |______________________|          |
             |                      |
          Hot page                  |
             |                      |
  ___________V______________________V_________
 |  Hot list - responsible for stability?     |
 |____________________________________________|
             |
        Timely hotlist data        
             |               Additional data (process newness, stack location...?)
   __________v__________________|___
  |  Promotion Daemon               |
  |_________________________________|

For all paths where data is flowing down we probably need control parameters
flowing back the other way + if we have multiple users of the datastream
we need to satisfy each of their constraints.

SJ has proposed perhaps extending Damon as a possible interface layer. I am
yet to understand how that works in cases where regions do not provide
a compact representation due to lack of contiguity in the hotness.
An example usecase is hypervisor wanting to migrate data under unaware,
cheap VMs.  After a system has been running for a while (particularly with hot
pages being migrated, swap etc) the hotness map looks much like noise.

Now for the "there be monsters bit"...
---------------------------------------

- Stability of hotness matters and is hard to establish.
  Predict a page will remain hot - various heuristics.
	a) It is hot, probably stays so? (super hot!)
	   Sometimes enough to be detected as hot once,
	   often not.
	b) It has been hot a while, probably stays so.
	   Check this hot list against previous hot list,
	   entries in both needed to promote.
	   This has a problem if hotlist is small compared to
	   total count of hot pages.  Say list is 1%, 20% actually
	   hot, low chance of repeats even in hot pages.
	c) It is hot, let's monitor a while before doing anything.
	   Measurement technique may change. Maybe cheaper
	   to monitor 'candidate' pages than all pages
	   e.g. CXL HMU gives 1000 pages, then we use access bit
	        sampling to check they are at least accessed N times
		in next second.
	d) It was hot, We moved it. Did it stay hot?
	   More useful to identify when we are thrashing and should
	   just stop doing anything.  To late to fix this one!
- Some data should be considered hot even when not in use (e.g. stack)
- Usecases interfere. So it can't just be a broadcast mode
  where hotness information is sent to all users.
- When to stop, start migration / tracking?
	a) Detecting bad decisions. Enough bad decisions, better to
	   do nothing?
 	b) Metadata beyond the counts is useful
	   https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
	   Promotion algorithms can need aggregate statistics for a memory 
	   device to decide how much to move.

As noted above, this may well overlap with other sessions.
One outcome of the discussion so far is to highlight what I think many
already knew.  This is hard!

Jonathan