Re: [PATCH v3 0/7] mm: workingset reporting

Yuanchu Xie <yuanchu@xxxxxxxxxx> · Mon, 26 Aug 2024 16:43:01 -0700

On Tue, Aug 20, 2024 at 6:00 AM Gregory Price <gourry@xxxxxxxxxx> wrote:
>
> On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace. IMO, the
> > kernel should provide a set of workingset interfaces that should be
> > generic enough to accommodate the various use cases, and be extensible
> > to potential future use cases. The current proposed interfaces are not
> > sufficient in that regard, but I would like to start somewhere, solicit
> > feedback, and iterate.
> >
> ... snip ...
> > Use cases
> > ==========
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated back
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >
> > [1]
> > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> >
> > Sysfs and Cgroup Interfaces
> > ==========
> > The interfaces are detailed in the patches that introduce them. The main
> > idea here is we break down the workingset per-node per-memcg into time
> > intervals (ms), e.g.
> >
> > 1000 anon=137368 file=24530
> > 20000 anon=34342 file=0
> > 30000 anon=353232 file=333608
> > 40000 anon=407198 file=206052
> > 9223372036854775807 anon=4925624 file=892892
> >
> > I realize this does not generalize well to hotness information, but I
> > lack the intuition for an abstraction that presents hotness in a useful
> > way. Based on a recent proposal for move_phys_pages[2], it seems like
> > userspace tiering software would like to move specific physical pages,
> > instead of informing the kernel "move x number of hot pages to y
> > device". Please advise.
> >
> > [2]
> > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@xxxxxxxxxxxx/
> >
>
> Just as a note on this work, this is really a testing interface.  The
> end-goal is not to merge such an interface that is user-facing like
> move_phys_pages, but instead to have something like a triggered kernel
> task that has a directive of "Promote X pages from Device A".
>
> This work is more of an open collaboration for prototyping such that we
> don't have to plumb it through the kernel from the start and assess the
> usefulness of the hardware hotness collection mechanism.

Understood. I think we previously had this exchange and I forgot to
remove the mentions from the cover letter.

>
> ---
>
> More generally on promotion, I have been considering recently a problem
> with promoting unmapped pagecache pages - since they are not subject to
> NUMA hint faults.  I started looking at PG_accessed and PG_workingset as
> a potential mechanism to trigger promotion - but i'm starting to see a
> pattern of competing priorities between reclaim (LRU/MGLRU) logic and
> promotion logic.

In this case, IMO hardware support would be good as it could provide
the kernel with exactly what pages are hot, and it would not care
whether a page is mapped or not. I recall there being some CXL
proposal on this, but I'm not sure whether it has settled into a
standard yet.

>
> Reclaim is triggered largely under memory pressure - which means co-opting
> reclaim logic for promotion is at best logically confusing, and at worst
> likely to introduce regressions.  The LRU/MGLRU logic is written largely
> for reclaim, not promotion.  This makes hacking promotion in after the
> fact rather dubious - the design choices don't match.
>
> One example: if a page moves from inactive->active (or old->young), we
> could treat this as a page "becoming hot" and mark it for promotion, but
> this potentially punishes pages on the "active/younger" lists which are
> themselves hotter.

To avoid punishing pages on the "young" list, one could insert the
page into a "less young" generation, but it would be difficult to have
a fixed policy for this in the kernel, so it may be best for this to
be configurable via BPF. One could insert the page in the middle of
the active/inactive list, but that would in effect create multiple
generations.

>
> I'm starting to think separate demotion/reclaim and promotion components
> are warranted. This could take the form of a separate kernel worker that
> occasionally gets scheduled to manage a promotion list, or even the
> addition of a PG_promote flag to decouple reclaim and promotion logic
> completely.  Separating the structures entirely would be good to allow
> both demotion/reclaim and promotion to occur concurrently (although this
> seems problematic under memory pressure).
>
> Would like to know your thoughts here.  If we can decide to segregate
> promotion and demotion logic, it might go a long way to simplify the
> existing interfaces and formalize transactions between the two.

The two systems still have to interact, so separating the two would
essentially create a new policy that decides whether the
demotion/reclaim or the promotion policy is in effect. If promotion
could figure out where to insert the page in terms of generations,
wouldn't that be simpler?

>
> (also if you're going to LPC, might be worth a chat in person)

I cannot make it to LPC. :( Sadness

Yuanchu