> > And here is an attempt to compile how different subsystems > > use the above data: > > ========================================================================================== > > Source Subsystem Consumption Activation/Frequency > > ========================================================================================== > > PROT_NONE faults NUMAB NUMAB=1 locality based While task is running, > > via process pgtable balancing rate varies on observed > > walk NUMAB=2 hot page locality and sysctl knobs. > > promotion > > ========================================================================================== > > folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap > > ========================================================================================== > > PTE A bit via Reclaim:LRU LRU list activation, During memory pressure > > rmap walk deactivation/demotion > > ========================================================================================== > > PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure > > rmap walk and process deactivation/demotion - Continuous sampling (configurable) > > pgtable walk for workingset reporting > > ========================================================================================== > > PTE A bit via DAMON LRU activation, > > rmap walk hot page promotion, > > demotion etc > > For virtual address spaces monitoring mode, DAMON uses PTE A bit via pgtable > walk. > > It's activation and frequency is basically set as user requests. Activation > can be set to be reactive to memory pressure like events (using watermarks). > Frequency can be auto-tuned for pursuing access events per snapshot ratio. Thanks. I've added that (in very brief form) to the table in my slides. > > SJ has proposed perhaps extending Damon as a possible interface layer. I am > > yet to understand how that works in cases where regions do not provide > > a compact representation due to lack of contiguity in the hotness. > > An example usecase is hypervisor wanting to migrate data under unaware, > > cheap VMs. After a system has been running for a while (particularly with hot > > pages being migrated, swap etc) the hotness map looks much like noise. > > Similar concerns for DAMON's region abstraction were raised for physical > address space monitoring, because there is no cautious effort for making hot > pages gathered together (or, locality). > > I'd argue there is no cautious effort to make temperature be spread, though. > As a result, we can expect a level of uncautious bias, and that matches with my > experiences from DAMON use cases on products environemnts so far. Whilst I'm not in a position to share the data, as it's not mine :( I've seen graphs that show that for at least some use cases, even if we have some contiguity of hotness in the VA space, it looks like noise in PA. So I think this is a case of 'mileage may vary'. Damon works great sometimes but sometime the spared of access statistics happen to be wrong. > > Also, in practice, DAMON regions are used in combination with other > information. For example, DAMON-based reclaim checkes PTE A bit of each page > in DAMON-suggested cold memory region to make final decision about whether to > reclaim or not it, like MADV_PAGEOUT does. Makes sense. The MADV_PAGEOUT case was one of the motivators for mixing methods suggestion. Here it's kind of DAMON + dense A bit checking (on candidate pages). > > That is, yes, I agree DAMON's region abstraction is maybe not a good way to > find perfect answer to some questions such as finding N-th hottest single page. > And it has many rooms to improve. Nevertheless, even DAMON of today can give > good enough best-effort answers for questions that practical for some cases, > such as finding regions that may containing N most hot/cold pages, while > letting the monitoring overhead fixed as users ask. > > Also, please note that there is no reason to restrict DAMON to always use > regions abstraction. For different use-cases and situation, DAMON will be open > to be extended to use new abstractions. DAMON aims not to be a subsystem for > DAMON regions concept but data access monitoring for practical efficiency, and > continue random evolution for given environments. Absolutely understood. In my current thinking Damon sits at a particular layer in the stack and there may be one more abstraction on top of it (e.g. a list of hot /cold pages). Equally possible that the layers may fuse and it becomes an aspect of DAMON. > > > > > Now for the "there be monsters bit"... > > --------------------------------------- > > > > - Stability of hotness matters and is hard to establish. > > Predict a page will remain hot - various heuristics. > > a) It is hot, probably stays so? (super hot!) > > Sometimes enough to be detected as hot once, > > often not. > > b) It has been hot a while, probably stays so. > > Check this hot list against previous hot list, > > entries in both needed to promote. > > This has a problem if hotlist is small compared to > > total count of hot pages. Say list is 1%, 20% actually > > hot, low chance of repeats even in hot pages. > > c) It is hot, let's monitor a while before doing anything. > > Measurement technique may change. Maybe cheaper > > to monitor 'candidate' pages than all pages > > e.g. CXL HMU gives 1000 pages, then we use access bit > > sampling to check they are at least accessed N times > > in next second. > > d) It was hot, We moved it. Did it stay hot? > > More useful to identify when we are thrashing and should > > just stop doing anything. To late to fix this one! > > DAMON is providing a sort of b) approach, aka DAMON regions' age, for finding > both hot and cold regions. > > > - Some data should be considered hot even when not in use (e.g. stack) > > DAMOS filters is for this kind of exceptions, and DAMON kernel API is flexible > enough to let callers directly manipulate the regions information based on > thier special knowledges. We can further optimize the interface for easier > uses, of course. Nice. > > > - Usecases interfere. So it can't just be a broadcast mode > > where hotness information is sent to all users. > > - When to stop, start migration / tracking? > > a) Detecting bad decisions. Enough bad decisions, better to > > do nothing? > > b) Metadata beyond the counts is useful > > https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/ > > Promotion algorithms can need aggregate statistics for a memory > > device to decide how much to move. > > DAMOS quotas goal feature is a sort of a feature for this question. It allows > users to set target metric and value, and tune the aggressiveness. For > promotions and demotions, I suggested using upper tier utilization and free > ratio as such possible goal metric, and gonna post an implementation for that > soon. Those are certainly good metrics to consider, but I think we definitely also need a metric around how beneficial are the moves being made. That matters more on the promotion path, because that interrupts access to hot data and so will cause a temporary drop in performance / latency spike. > > > > > As noted above, this may well overlap with other sessions. > > One outcome of the discussion so far is to highlight what I think many > > already knew. This is hard! > > Indeed. Keeping more people on the same page is important and difficult. > Thank you for your effort again, and looking forward to discuss in more depth! > I'm not sure we'll succeed. This may well be a wild west situation for a while yet, but hopefully we can slowly converge or at least build some common parts. Jonathan p.s. Heathrow disruption means I'm crossing my fingers on actually getting to Montreal. > > Thanks, > SJ > > > > > Jonathan