Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork

Yu Zhao <yuzhao@xxxxxxxxxx> · Sat, 12 Mar 2022 14:11:49 -0700

On Sat, Mar 12, 2022 at 3:37 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Sat, Mar 12, 2022 at 12:45 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >
> > On Fri, Mar 11, 2022 at 3:16 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 10:43 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > >
> > > > Thanks for reviewing.
> > > >
> > > > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > > > +{
> > > > > > +   unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > > > +
> > > > > > +   VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > > > +
> > > > > > +   /* see the comment on MIN_NR_GENS */
> > > > > > +   return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > > > +}
> > > > >
> > > > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > > > set is split into two generations for the sole purpose of the
> > > > > second-chance policy for fresh faults, right?
> > > >
> > > > To be precise, the active/inactive notion on top of generations is
> > > > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > > > Otherwise, this function wouldn't be needed.
> > >
> > > Hi Yu,
> > > I am still quite confused as i am seeing both active/inactive and lru_gen.
> > > eg:
> > >
> > > root@ubuntu:~# cat /proc/vmstat | grep active
> > > nr_zone_inactive_anon 22797
> > > nr_zone_active_anon 578405
> > > nr_zone_inactive_file 0
> > > nr_zone_active_file 4156
> > > nr_inactive_anon 22800
> > > nr_active_anon 578574
> > > nr_inactive_file 0
> > > nr_active_file 4215
> >
> > Yes, this is expected. We have to maintain the ABI, i.e., the
> > *_active/inactive_* counters.
> >
> > > and:
> > >
> > > root@ubuntu:~# cat /sys//kernel/debug/lru_gen
> > >
> > > ...
> > > memcg    36 /user.slice/user-0.slice/user@0.service
> > >  node     0
> > >          20      18820         22           0
> > >          21       7452          0           0
> > >          22       7448          0           0
> > > memcg    33 /user.slice/user-0.slice/user@0.service/app.slice
> > >  node     0
> > >           0    2171452          0           0
> > >           1    2171452          0           0
> > >           2    2171452          0           0
> > >           3    2171452          0           0
> > > memcg    37 /user.slice/user-0.slice/session-1.scope
> > >  node     0
> > >          42      51804     102127           0
> > >          43      18840     275622           0
> > >          44      16104     216805           1
> > >
> > > Does it mean one page could be in both one of the generations and one
> > > of the active/inactive lists?
> >
> > In terms of the data structure, evictable pages are either on
> > lruvec->lists or lrugen->lists.
> >
> > > Do we have some mapping relationship between active/inactive lists
> > > with generations?
> >
> > For the counters, yes -- pages in max_seq and max_seq-1 are counted as
> > active, and the rest are inactive.
> >
> > > We used to put a faulted file page in inactive, if we access it a
> > > second time, it can be promoted
> > > to active. then in recent years, we have also applied this to anon
> > > pages while kernel adds
> > > workingset protection for anon pages. so basically both anon and file
> > > pages go into the inactive
> > > list for the 1st time, if we access it for the second time, they go to
> > > the active list. if we don't access
> > > it any more, they are likely to be reclaimed as they are inactive.
> > > we do have some special fastpath for code section, executable file
> > > pages are kept on active list
> > > as long as they are accessed.
> >
> > Yes.
> >
> > > so all of the above concerns are actually not that correct?
> >
> > They are valid concerns but I don't know any popular workloads that
> > care about them.
>
> Hi Yu,
> here we can get a workload in Kim's patchset while he added workingset
> protection
> for anon pages:
> https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@xxxxxxx/

Thanks. I wouldn't call that a workload because it's not a real
application. By popular workloads, I mean applications that the
majority of people actually run on phones, in cloud, etc.

> anon pages used to go to active rather than inactive, but kim's patchset
> moved to use inactive first. then only after the anon page is accessed
> second time, it can move to active.

Yes. To clarify, the A-bit doesn't really mean the first or second
access. It can be many accesses each time it's set.

> "In current implementation, newly created or swap-in anonymous page is
>
> started on the active list. Growing the active list results in rebalancing
> active/inactive list so old pages on the active list are demoted to the
> inactive list. Hence, hot page on the active list isn't protected at all.
>
> Following is an example of this situation.
>
> Assume that 50 hot pages on active list and system can contain total
> 100 pages. Numbers denote the number of pages on active/inactive
> list (active | inactive). (h) stands for hot pages and (uo) stands for
> used-once pages.
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(uo) | 50(h)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(uo) | 50(uo), swap-out 50(h)
>
> As we can see, hot pages are swapped-out and it would cause swap-in later."
>
> Is MGLRU able to avoid the swap-out of the 50 hot pages?

I think the real question is why the 50 hot pages can be moved to the
inactive list. If they are really hot, the A-bit should protect them.

> since MGLRU
> is putting faulted pages to the youngest generation directly, do we have the
> risk mentioned in Kim's patchset?

There are always risks :) I could imagine a thousand ways to make VM
suffer, but all of them could be irrelevant to how it actually does in
production. So a concrete use case of yours would be much appreciated
for this discussion.