Re: MGLRU premature memcg OOM on slow writes

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 12 Mar 2024 23:22:52 -0400

On Tue, Mar 12, 2024 at 10:08:13PM -0400, Yu Zhao wrote:
> On Tue, Mar 12, 2024 at 5:08 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> > > Yes, these two are among the differences between the active/inactive
> > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> > > positions of dirty pages. The active/inactive LRU moves dirty pages
> > > all the way to the end of the line (reclaim happens at the front)
> > > whereas MGLRU moves them into the middle, during direct reclaim. The
> > > rationale for MGLRU was that this way those dirty pages would still
> > > be counted as "inactive" (or cold).
> >
> > Note that activating the page is not a statement on the page's
> > hotness. It's simply to park it away from the scanner. We could as
> > well have moved it to the unevictable list - this is just easier.
> >
> > folio_end_writeback() will call folio_rotate_reclaimable() and move it
> > back to the inactive tail, to make it the very next reclaim target as
> > soon as it's clean.
> >
> > > This theory can be quickly verified by comparing how much
> > > nr_vmscan_immediate_reclaim grows, i.e.,
> > >
> > >   Before the copy
> > >     grep nr_vmscan_immediate_reclaim /proc/vmstat
> > >   And then after the copy
> > >     grep nr_vmscan_immediate_reclaim /proc/vmstat
> > >
> > > The growth should be trivial for MGLRU and nontrivial for the
> > > active/inactive LRU.
> > >
> > > If this is indeed the case, I'd appreciate very much if anyone could
> > > try the following (I'll try it myself too later next week).
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 4255619a1a31..020f5d98b9a1 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >       }
> > >
> > >       /* waiting for writeback */
> > > -     if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > > -         (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > -             gen = folio_inc_gen(lruvec, folio, true);
> > > -             list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > +     if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > +             DEFINE_MAX_SEQ(lruvec);
> > > +             int old_gen, new_gen = lru_gen_from_seq(max_seq);
> > > +
> > > +             old_gen = folio_update_gen(folio, new_gen);
> > > +             lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> > > +             list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> > >               return true;
> >
> > Right, because MGLRU sorts these pages out before calling the scanner,
> > so they never get marked for immediate reclaim.
> >
> > But that also implies they won't get rotated back to the tail when
> > writeback finishes.
> 
> Those dirty pages are marked by PG_reclaim either by
> 
>   folio_inc_gen()
>   {
>     ...
>     if (reclaiming)
>       new_flags |= BIT(PG_reclaim);
>     ...
>   }
> 
> or [1], which I missed initially. So they should be rotated on writeback
> finishing up.
> 
> [1] https://lore.kernel.org/linux-mm/ZfC2612ZYwwxpOmR@xxxxxxxxxx/

Ah, I missed that! Thanks.

> > Doesn't that mean that you now have pages that
> >
> > a) came from the oldest generation and were only deferred due to their
> >    writeback state, and
> >
> > b) are now clean and should be reclaimed. But since they're
> >    permanently advanced to the next gen, you'll instead reclaim pages
> >    that were originally ahead of them, and likely hotter.
> >
> > Isn't that an age inversion?
> >
> > Back to the broader question though: if reclaim demand outstrips clean
> > pages and the only viable candidates are dirty ones (e.g. an
> > allocation spike in the presence of dirty/writeback pages), there only
> > seem to be 3 options:
> >
> > 1) sleep-wait for writeback
> > 2) continue scanning, aka busy-wait for writeback + age inversions
> > 3) find nothing and declare OOM
> >
> > Since you're not doing 1), it must be one of the other two, no? One
> > way or another it has to either pace-match to IO completions, or OOM.
> 
> Yes, and in this case, 2) is possible but 3) is very likely.
> 
> MGLRU doesn't do 1) for sure (in the reclaim path of course). I didn't
> find any throttling on dirty pages for cgroup v2 either in the
> active/inactive LRU -- I assume Chris was on v2, and hence my take on
> throttling on dirty pages in the reclaim path not being the key for
> his case.

It's kind of spread out, but it's there:

shrink_folio_list() will bump nr_dirty on dirty pages, and
nr_congested if immediate reclaim folios cycle back around.

shrink_inactive_list() will wake the flushers if all the dirty pages
it encountered are still unqueued.

shrink_node() will set LRUVEC_CGROUP_CONGESTED, and then call
reclaim_throttle() on it. (As Chris points out, though, the throttle
call was not long ago changed from VMSCAN_THROTTLE_WRITEBACK to
VMSCAN_THROTTLE_CONGESTED, and appears a bit more fragile now than it
used to be. Probably worth following up on this.)