Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Fri, 21 Mar 2025 15:30:44 +0000

> > And here is an attempt to compile how different subsystems
> > use the above data:
> > ==========================================================================================
> > Source			Subsystem	Consumption		Activation/Frequency
> > ==========================================================================================
> > PROT_NONE faults	NUMAB		NUMAB=1 locality based	While task is running,
> > via process pgtable			balancing		rate varies on observed
> > walk					NUMAB=2 hot page	locality and sysctl knobs.
> > 					promotion
> > ==========================================================================================
> > folio_mark_accessed()	FS/filemap/GUP	LRU list activation	On cache access and unmap
> > ==========================================================================================
> > PTE A bit via		Reclaim:LRU	LRU list activation,	During memory pressure
> > rmap walk				deactivation/demotion
> > ==========================================================================================
> > PTE A bit via		Reclaim:MGLRU	LRU list activation,	- During memory pressure
> > rmap walk and process			deactivation/demotion	- Continuous sampling (configurable)
> > pgtable walk							  for workingset reporting
> > ==========================================================================================
> > PTE A bit via		DAMON		LRU activation,
> > rmap walk				hot page promotion,
> > 					demotion etc  
> 
> For virtual address spaces monitoring mode, DAMON uses PTE A bit via pgtable
> walk.
> 
> It's activation and frequency is basically set as user requests.  Activation
> can be set to be reactive to memory pressure like events (using watermarks).
> Frequency can be auto-tuned for pursuing access events per snapshot ratio.

Thanks.  I've added that (in very brief form) to the table in my slides.

> > SJ has proposed perhaps extending Damon as a possible interface layer. I am
> > yet to understand how that works in cases where regions do not provide
> > a compact representation due to lack of contiguity in the hotness.
> > An example usecase is hypervisor wanting to migrate data under unaware,
> > cheap VMs.  After a system has been running for a while (particularly with hot
> > pages being migrated, swap etc) the hotness map looks much like noise.  
> 
> Similar concerns for DAMON's region abstraction were raised for physical
> address space monitoring, because there is no cautious effort for making hot
> pages gathered together (or, locality).
> 
> I'd argue there is no cautious effort to make temperature be spread, though.
> As a result, we can expect a level of uncautious bias, and that matches with my
> experiences from DAMON use cases on products environemnts so far.

Whilst I'm not in a position to share the data, as it's not mine :( I've
seen graphs that show that for at least some use cases, even if we have some
contiguity of hotness in the VA space, it looks like noise in PA.  So
I think this is a case of 'mileage may vary'. Damon works great sometimes but
sometime the spared of access statistics happen to be wrong.

> 
> Also, in practice, DAMON regions are used in combination with other
> information.  For example, DAMON-based reclaim checkes PTE A bit of each page
> in DAMON-suggested cold memory region to make final decision about whether to
> reclaim or not it, like MADV_PAGEOUT does.

Makes sense.  The MADV_PAGEOUT case was one of the motivators for mixing
methods suggestion.  Here it's kind of DAMON + dense A bit checking (on
candidate pages).

> 
> That is, yes, I agree DAMON's region abstraction is maybe not a good way to
> find perfect answer to some questions such as finding N-th hottest single page.
> And it has many rooms to improve.  Nevertheless, even DAMON of today can give
> good enough best-effort answers for questions that practical for some cases,
> such as finding regions that may containing N most hot/cold pages, while
> letting the monitoring overhead fixed as users ask.
> 
> Also, please note that there is no reason to restrict DAMON to always use
> regions abstraction.  For different use-cases and situation, DAMON will be open
> to be extended to use new abstractions.  DAMON aims not to be a subsystem for
> DAMON regions concept but data access monitoring for practical efficiency, and
> continue random evolution for given environments.

Absolutely understood. In my current thinking Damon sits at a particular layer
in the stack and there may be one more abstraction on top of it (e.g. a list
of hot /cold pages). Equally possible that the layers may fuse and it becomes
an aspect of DAMON.

> 
> > 
> > Now for the "there be monsters bit"...
> > ---------------------------------------
> > 
> > - Stability of hotness matters and is hard to establish.
> >   Predict a page will remain hot - various heuristics.
> > 	a) It is hot, probably stays so? (super hot!)
> > 	   Sometimes enough to be detected as hot once,
> > 	   often not.
> > 	b) It has been hot a while, probably stays so.
> > 	   Check this hot list against previous hot list,
> > 	   entries in both needed to promote.
> > 	   This has a problem if hotlist is small compared to
> > 	   total count of hot pages.  Say list is 1%, 20% actually
> > 	   hot, low chance of repeats even in hot pages.
> > 	c) It is hot, let's monitor a while before doing anything.
> > 	   Measurement technique may change. Maybe cheaper
> > 	   to monitor 'candidate' pages than all pages
> > 	   e.g. CXL HMU gives 1000 pages, then we use access bit
> > 	        sampling to check they are at least accessed N times
> > 		in next second.
> > 	d) It was hot, We moved it. Did it stay hot?
> > 	   More useful to identify when we are thrashing and should
> > 	   just stop doing anything.  To late to fix this one!  
> 
> DAMON is providing a sort of b) approach, aka DAMON regions' age, for finding
> both hot and cold regions.
> 
> > - Some data should be considered hot even when not in use (e.g. stack)  
> 
> DAMOS filters is for this kind of exceptions, and DAMON kernel API is flexible
> enough to let callers directly manipulate the regions information based on
> thier special knowledges.  We can further optimize the interface for easier
> uses, of course.

Nice.

> 
> > - Usecases interfere. So it can't just be a broadcast mode
> >   where hotness information is sent to all users.
> > - When to stop, start migration / tracking?
> > 	a) Detecting bad decisions. Enough bad decisions, better to
> > 	   do nothing?
> >  	b) Metadata beyond the counts is useful
> > 	   https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> > 	   Promotion algorithms can need aggregate statistics for a memory 
> > 	   device to decide how much to move.  
> 
> DAMOS quotas goal feature is a sort of a feature for this question.  It allows
> users to set target metric and value, and tune the aggressiveness.  For
> promotions and demotions, I suggested using upper tier utilization and free
> ratio as such possible goal metric, and gonna post an implementation for that
> soon.

Those are certainly good metrics to consider, but I think we definitely also
need a metric around how beneficial are the moves being made.

That matters more on the promotion path, because that interrupts access to
hot data and so will cause a temporary drop in performance / latency spike.

> 
> > 
> > As noted above, this may well overlap with other sessions.
> > One outcome of the discussion so far is to highlight what I think many
> > already knew.  This is hard!  
> 
> Indeed.  Keeping more people on the same page is important and difficult.
> Thank you for your effort again, and looking forward to discuss in more depth!
>

I'm not sure we'll succeed.  This may well be a wild west situation for a while
yet, but hopefully we can slowly converge or at least build some common
parts.

Jonathan

p.s. Heathrow disruption means I'm crossing my fingers on actually getting to
Montreal.

> 
> Thanks,
> SJ
> 
> > 
> > Jonathan