On Thu, 6 Mar 2025 11:15:30 +0530 Bharata B Rao <bharata@xxxxxxx> wrote: > kpromoted is a kernel daemon that accumulates hot page info > from different sources and tries to promote pages from slow > tiers to top tiers. One instance of this thread runs on each > node that has CPUs. > Firstly, nice work. Much easier to discuss things with an implementation to look at. I'm looking at this with my hardware hotness unit "hammer" in hand :) > Subsystems that generate hot page access info can report that > to kpromoted via this API: > > int kpromoted_record_access(u64 pfn, int nid, int src, > unsigned long time) This perhaps works as an interface for aggregating methods that produce per access events. Any hardware counter solution is going to give you data that is closer to what you used for the promotion decision. We might need to aggregate at different levels. So access counting promotes to a hot list and we can inject other events at that level. The data I have from the CXL HMU is typically after an epoch (period of time) these N pages were accessed more than M times. I can sort of map that to the internal storage you have. Would be good to evaluate approximate trackers on top of access counts. I've no idea if sketches or similar would be efficient enough (they have a bit of a write amplification problem) but they may give good answers with much lower storage cost at the risk of occasionally saying something is hot when it's not. > > @pfn: The PFN of the memory accessed > @nid: The accessing NUMA node ID > @src: The temperature source (subsystem) that generated the > access info > @time: The access time in jiffies > > Some temperature sources may not provide the nid from which > the page was accessed. This is true for sources that use > page table scanning for PTE Accessed bit. Currently the toptier > node to which such pages should be promoted to is hard coded. For those cases (CXL HMU included) maybe we need to consider how to fill in missing node info with at least a vague chance of getting a reasonable target for migration. We can always fall back to random top tier node, or nearest one to where we are coming from (on basis we maybe landed in this node based on a fallback list when the top tier was under memory pressure).