On Tue, Aug 20, 2024 at 6:00 AM Gregory Price <gourry@xxxxxxxxxx> wrote: > > On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote: > > This patch series provides workingset reporting of user pages in > > lruvecs, of which coldness can be tracked by accessed bits and fd > > references. However, the concept of workingset applies generically to > > all types of memory, which could be kernel slab caches, discardable > > userspace caches (databases), or CXL.mem. Therefore, data sources might > > come from slab shrinkers, device drivers, or the userspace. IMO, the > > kernel should provide a set of workingset interfaces that should be > > generic enough to accommodate the various use cases, and be extensible > > to potential future use cases. The current proposed interfaces are not > > sufficient in that regard, but I would like to start somewhere, solicit > > feedback, and iterate. > > > ... snip ... > > Use cases > > ========== > > Promotion/Demotion > > If different mechanisms are used for promition and demotion, workingset > > information can help connect the two and avoid pages being migrated back > > and forth. > > For example, given a promotion hot page threshold defined in reaccess > > distance of N seconds (promote pages accessed more often than every N > > seconds). The threshold N should be set so that ~80% (e.g.) of pages on > > the fast memory node passes the threshold. This calculation can be done > > with workingset reports. > > To be directly useful for promotion policies, the workingset report > > interfaces need to be extended to report hotness and gather hotness > > information from the devices[1]. > > > > [1] > > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 > > > > Sysfs and Cgroup Interfaces > > ========== > > The interfaces are detailed in the patches that introduce them. The main > > idea here is we break down the workingset per-node per-memcg into time > > intervals (ms), e.g. > > > > 1000 anon=137368 file=24530 > > 20000 anon=34342 file=0 > > 30000 anon=353232 file=333608 > > 40000 anon=407198 file=206052 > > 9223372036854775807 anon=4925624 file=892892 > > > > I realize this does not generalize well to hotness information, but I > > lack the intuition for an abstraction that presents hotness in a useful > > way. Based on a recent proposal for move_phys_pages[2], it seems like > > userspace tiering software would like to move specific physical pages, > > instead of informing the kernel "move x number of hot pages to y > > device". Please advise. > > > > [2] > > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@xxxxxxxxxxxx/ > > > > Just as a note on this work, this is really a testing interface. The > end-goal is not to merge such an interface that is user-facing like > move_phys_pages, but instead to have something like a triggered kernel > task that has a directive of "Promote X pages from Device A". > > This work is more of an open collaboration for prototyping such that we > don't have to plumb it through the kernel from the start and assess the > usefulness of the hardware hotness collection mechanism. Understood. I think we previously had this exchange and I forgot to remove the mentions from the cover letter. > > --- > > More generally on promotion, I have been considering recently a problem > with promoting unmapped pagecache pages - since they are not subject to > NUMA hint faults. I started looking at PG_accessed and PG_workingset as > a potential mechanism to trigger promotion - but i'm starting to see a > pattern of competing priorities between reclaim (LRU/MGLRU) logic and > promotion logic. In this case, IMO hardware support would be good as it could provide the kernel with exactly what pages are hot, and it would not care whether a page is mapped or not. I recall there being some CXL proposal on this, but I'm not sure whether it has settled into a standard yet. > > Reclaim is triggered largely under memory pressure - which means co-opting > reclaim logic for promotion is at best logically confusing, and at worst > likely to introduce regressions. The LRU/MGLRU logic is written largely > for reclaim, not promotion. This makes hacking promotion in after the > fact rather dubious - the design choices don't match. > > One example: if a page moves from inactive->active (or old->young), we > could treat this as a page "becoming hot" and mark it for promotion, but > this potentially punishes pages on the "active/younger" lists which are > themselves hotter. To avoid punishing pages on the "young" list, one could insert the page into a "less young" generation, but it would be difficult to have a fixed policy for this in the kernel, so it may be best for this to be configurable via BPF. One could insert the page in the middle of the active/inactive list, but that would in effect create multiple generations. > > I'm starting to think separate demotion/reclaim and promotion components > are warranted. This could take the form of a separate kernel worker that > occasionally gets scheduled to manage a promotion list, or even the > addition of a PG_promote flag to decouple reclaim and promotion logic > completely. Separating the structures entirely would be good to allow > both demotion/reclaim and promotion to occur concurrently (although this > seems problematic under memory pressure). > > Would like to know your thoughts here. If we can decide to segregate > promotion and demotion logic, it might go a long way to simplify the > existing interfaces and formalize transactions between the two. The two systems still have to interact, so separating the two would essentially create a new policy that decides whether the demotion/reclaim or the promotion policy is in effect. If promotion could figure out where to insert the page in terms of generations, wouldn't that be simpler? > > (also if you're going to LPC, might be worth a chat in person) I cannot make it to LPC. :( Sadness Yuanchu