Re: Slow-tier Page Promotion discussion recap and open questions

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 1 Jan 2025 20:44:39 -0800 (PST)

On Fri, 20 Dec 2024, Raghavendra K T wrote:

> > I asked if this was really done single threaded, which was confirmed.  If
> > only a single process has pages on a slow memory tier, for example, then
> > flexible tuning of the scan period and size ensures we do not scan
> > needlessly.  The scan period can be tuned to be more responsive (down to
> > 400ms in this proposal) depending on how many accesses we have on the
> > last scan; similarly, it can be much less responsive (up to 5s) if memory
> > is not found to be accessed.
> > 
> > I also asked if scanning can be disabled entirely, Raghu clarified that
> > it cannot be.
> > 
> 
> We have a sysfs tunable (kmmscand/scan_enabled) to enable/disable the
> whole scanning at a global level but not at process level granularity.
> 

Thanks Raghu for the clarification.  I think during discussion that there 
was a preference to make this multi-threaded so we didn't rely on a single 
kmmscand thread, perhaps this would be (at minimum) one kmmscand thread 
per NUMA node?

> > Wei Xu asked if the scan period should be interpreted as the minimal
> > interval between scans because kmmscand is single threaded and there are
> > many processes.  Raghu confirmed this is correct, the minimal delay.
> > Even if the scan period is 400ms, in reality it could be multiple seconds
> > based on load.
> > 
> > Liam Howlett asked how we could have two scans colliding in a time
> > segment.  Raghu noted if we are able to complete the last scan in less
> > time than 400ms, then we have this delay to avoid continuously scanning
> > that results in increased cpu overhead.  Liam further asked if processes
> > opt into a scan or out of the scan, Raghu noted we always scan every
> > process.  John Hubbard suggested that we have per-process control.
> 
> +1 for prctl()
> 
> Also I want to add that, I will get data on:
> 
> what is the min and max time required to finish the entire scan for the
> current micro-benchmark and one of the real workload (such as Redis/
> Rocksdb...), so that we can check if we are meeting the deadline of
> scanning with single kthread.
> 

Do we want more fine-grained per-process control other than just the 
ability to opt out entire processes?  There may be situations where we 
want to always serve latency tolerant jobs from CXL extended memory, we 
don't care to ever promote its memory, but I also think there will be 
processes that are between the two extremes (latency critical and latency 
tolerant).

I think careful consideration needs to be given to how we handle 
per-process policy for multi-tenant systems that have different levels of 
latency sensitivity.  If kmmscand becomes the standard way of doing page 
promotion in the kernel, the userspace API to inform it of these policy 
decisions is going to be key.  There have been approaches where this was 
primarily driven by BPF that has to solve the same challenge.

> > Wei noted an important point about separating hot page detection and
> > promotion, which don't actually need to be coupled at all.  This uses
> > page table scanning while future support may not need to leverage this at
> > all.  We'd very much like to avoid multiple promotion solutions for
> > different ways to track page hotness.
> > 
> > I strongly supported this because I believe for CXL, at least within the
> > next three years, that memory hotness will likely not be derived from
> > page table Accessed bit scanning.  Zi Yan agreed.
> > 
> > The promotion path may also want to be much less aggressive than on first
> > access.  Raghu showed many improvements, including handling short lived
> > processes, more accurate hot page detection using timestamp, etc.
> 
> Some of these TODOs can be implemented in next version.
> 

Thanks!  Are you planning on sending out another RFC patch series soon or 
are you interested in publishing this on git.kernel.org or github?  There 
may be an opportunity for others to send you pull requests into the series 
of patches while we discuss.

> > ----->o-----
> > I followed up on a discussion point early in the talk about whether this
> > should be virtual address scanning like the current approach, walking
> > mm_struct's, or the alternative approach which would be physical address
> > scanning.
> > 
> > Raghu sees this as a fully alternative approach such as what DAMON uses
> > that is based on rmap.  The only advantage appears to be avoiding
> > scanning on top tier memory completely.
> 
> Having a clarity here would help. Both the approaches have its own pros
> and cons.
> 
> Need to also explore on using / Reusing DMAON/ MGLRU.. to the extent possible
> based on the approach.
> 

Yeah, I definitely think this is a key point to discuss early on.  Gregory 
had indicated that unmapped file cache is one of the key downsides to 
using only virtual memory scanning.

While things like the CHMU are still on the way, I think there's benefit 
to making incremental progress from what we currently have available (NUMA 
Balancing) before we get there.