On Wed, 5 Feb 2025 11:05:29 -0500 Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote: > > On 31-Jan-25 6:39 PM, Jonathan Cameron wrote: > > > On Fri, 31 Jan 2025 12:28:03 +0000 > > > Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > > > > > >>> Here is the list of potential discussion points: > > >> ... > > >> > > >>> 2. Possibility of maintaining single source of truth for page hotness that would > > >>> maintain hot page information from multiple sources and let other sub-systems > > >>> use that info. > > >> Hi, > > >> > > >> I was thinking of proposing a separate topic on a single source of hotness, > > >> but this question covers it so I'll add some thoughts here instead. > > >> I think we are very early, but sharing some experience and thoughts in a > > >> session may be useful. > > > > > > Thinking more on this over lunch, I think it is worth calling this out as a > > > potential session topic in it's own right rather than trying to find > > > time within other sessions. Hence the title change. > > > > > > I think a session would start with a brief listing of the temperature sources > > > we have and those on the horizon to motivate what we are unifying, then > > > discussion to focus on need for such a unification + requirements > > > (maybe with a straw man). > > > > Here is a compilation of available temperature sources and how the > > hot/access data is consumed by different subsystems: > > This is super useful, thanks for collecting this. Absolutely agree! > > > PA-Physical address available > > VA-Virtual address available > > AA-Access time available > > NA-accessing Node info available > > > > I have left the slot blank for those which I am not sure about. > > ================================================== > > Temperature PA VA AA NA > > source > > ================================================== > > PROT_NONE faults Y Y Y Y > > -------------------------------------------------- > > folio_mark_accessed() Y Y Y > > -------------------------------------------------- > > For fma(), the VA info is available in unmap, but usually it isn't - > or doesn't meaningfully exist, as in the case of unmapped buffered IO. > > I'd say it's an N. > > > PTE A bit Y Y N N > > -------------------------------------------------- > > Platform hints Y Y Y Y > > (AMD IBS) > > -------------------------------------------------- > > Device hints Y > > (CXL HMU) > > ================================================== For the use cases where we have relatively few 'pages' the cost of a reverse map look up doesn't look to be a problem. Trick is to do it only after we've done what we can in PA space to cut down on the pages of interest. So maybe (Y) to reflect that it is indirect. Whether it makes sense to do that before or after some common layer is an interesting question. That PA/VA mapping might be out of date anyway by the time we see the data. > > For the following table, it might be useful to add *when* the source > produces this information. Sampling frequency is a likely challenge: > consumers have different requirements, and overhead should be limited > to the minimum required to serve enabled consumers. > > Here is an (incomplete) attempt - sorry about the long lines: > > > And here is an attempt to compile how different subsystems > > use the above data: > > ============================================================== > > Source Subsystem Consumption Activation/Frequency > > ============================================================== > > PROT_NONE faults NUMAB NUMAB=1 locality based While task is running, > > via process pgtable balancing rate varies on observed > > walk NUMAB=2 hot page locality and sysctl knobs. > > promotion > > ============================================================== > > folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap > > ============================================================== > > PTE A bit via Reclaim:LRU LRU list activation, During memory pressure > > rmap walk deactivation/demotion > > ============================================================== > > PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure > > rmap walk and process deactivation/demotion - Continuous sampling (configurable) > > pgtable walk for workingset reporting > > ============================================================== > > PTE A bit via DAMON LRU activation, Continuous sampling (configurable)? > > rmap walk hot page promotion, (I believe SJ is looking into > > demotion etc auto-tuning this). > > ============================================================== > > Platform hints NUMAB NUMAB=1 Locality based > > (AMD IBS) balancing and > > NUMAB=2 hot page > > promotion > > ============================================================== Based on the CXL one... > > Device hints NUMAB NUMAB=2 hot page Continuous sampling, frequency controllable. > > promotion Subsampling programable. > > ============================================================== > > The last two are listed as possibilities. > > > > Feel free to correct/clarify and add more. The above covers what the use cases require. Maybe we need to do similar for the controls needed the other way (frequency already covered) Filtering. * Process ID * Address range (PA / VA) * Access type (read vs write) may matter for migration cost. Also frequency is more nuanced perhaps: - How often to give data (timeliness) - How much data to give (bandwidth) - When don't I care (threshold) - How precise do I want it to be (subsampling etc) The layering is clearly to be complex, so maybe addressing each use case for what info that needs would be helpful? The following is probably too simplistic. ================================================================== Usecase Nature of data ================================================================== NUMAB =1 Enough hot pages with remote source. Balancing ================================================================== NUMAB =2 Enough hot pages in slow memory Tiering Promotion ================================================================== NUMAB = 2 Enough cold pages in fast memory Tiering Demotion =================================================================== LRU list Specific pages of interest accessed activation =================================================================== LRU list Enough cold pages? deactivation ==================================================================== Jonathan > > > > Regards, > > Bharata.