On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote: > ----->o----- > Raghu noted the current promotion destination is node 0 by default. Wei > noted we could get some page owner information to determine things like > mempolicies or compute the distance between nodes and, if multiple nodes > have the same distance, choose one of them just as we do for demotions. > > Gregory Price noted some downsides to using mempolicies for this based on > per-task, per-vma, and cross socket policies, so using the kernel's > memory tiering policies is probably the best way to go about it. > Slightly elaborating here: - In an async context, associating a page with a specific task is not presently possible (that I know of). The most we know is the last accessing CPU - maybe - in the page/folio struct. Right now this is disabled in favor of a timestamp when tiering is enabled. a process with 2 tasks which have access to the page may not run on the same socket, so we run the risk of migrating to a bad target. Best effort here would suggest either socket is fine - since they're both "fast nodes" - but this requires that we record the last accessing CPU for a page at identification time. - Even if you could associate with a particular task, the task and/or cgroup are not guaranteed to have a socket affinity. Though obviously if it does, that can be used (just doesn't satisfy default behavior). Basically just saying we shouldn't depend on this - per-vma mempolicies are a potential solution, but they're not very common in the wild - software would have to become numa aware and utilize mbind() on particular memory regions. Likewise we shouldn't depend on this either. - This holds for future mechanisms like CHMU, whose accessing data is even more abstract (no concept of accessing task / cpu / owner at all) More generally - in an async scanning context it's presently not possible to identify the optimal promotion node - and it likely is not possible without userland hints. So probably we should just leverage static configuration data (HMAT) and some basic math to put together a promotion target in a similar way to how we calculate a demotion target. Long winded way of saying I don't think an optimal solution is possible, so lets start with suboptimal and get data. > ----->o----- > My takeaways: > > - there is a definite need to separate hot page detection and the > promotion path since hot pages may be derived from multiple sources, > including hardware assists in the future > > - for the hot page tracking itself, a common abstraction to be used that > can effectively describe hotness regardless of the backend it is > deriving its information from would likely be quite useful > In a synchronous context (Accessing Task), something like: target_node = numa_node_id; # cpu we're currently operating on promote_pagevec(vec, numa_node_id, PROMOTE_DEFER); where the function promotion logic then does something like: promote_batch(pagevec, target) In an asynchronous context (Scanning Task), something like: promote_pagevec(vec, -1, PROMOTE_DEFER); where the promotion logic then does something like for page in pagevec: target = memory_tiers_promotion_target(page_to_nid(page)) promote(folio, target) Plumbing-wise this can be optimized to identify similarly located pages into a sub-pagevec and use promote_batch() semantics. My gut says this is the best we're going to get, since async contexts can't identify accessor locations easily (especially CHMU). > - I think virtual memory scanning is likely the only viable approach for Hard disagree. Virtual memory scanning misses an entire class of memory Unmapped file cache. https://lore.kernel.org/linux-mm/20241210213744.2968-1-gourry@xxxxxxxxxx/ > this purpose and we could store state in the underlying struct page, This is contentious. Look at folio->_last_cpupid for context, we're already overloading fields in subtle ways to steal a 32 bit area. > similar to NUMA Balancing, but that all scanning should be driven by > walking the mm_struct's to harvest the Accessed bit > If the goal is to do multi-tenant tiering (i.e. many mm_struct's), then this scales poorly by design. Elsewhere, folks agreed that CXL-memory will have HMU-driven hotness data as the primary mechanism. This is a physical-memory hotness tracking mechanism that avoids scanning page tables or page structs. If we think that's the direction it's going, then we shouldn't bother investing a ton of effort into a virtual-memory driven design as the primary user. (Sure, support it, but don't dive too much further in) > - if there is any general pushback on leveraging a kthread for this, > this would be very good feedback to have early > I think for the promotion system, having one or more kthreads based on promotion pressure is a good idea. I'm not sure how well this will scale for many-process, high-memory systems (1TB+ on a scanning interval of 256MB is very low accuracy). Need more data! ~Gregory