Re: [mm-unstable v4 5/5] mm: memcg: restore subtree stats flushing

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 4 Dec 2023 11:51:29 -0800

[..]
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index dce41577a49d2..7d3dacab8451a 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -464,8 +464,12 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
> >
> >       rcu_read_unlock();
> >
> > -     /* Flush stats (and potentially sleep) outside the RCU read section */
> > -     mem_cgroup_flush_stats_ratelimited();
> > +     /*
> > +      * Flush stats (and potentially sleep) outside the RCU read section.
> > +      * XXX: With per-memcg flushing and thresholding, is ratelimiting
> > +      * still needed here?
> > +      */
> > +     mem_cgroup_flush_stats_ratelimited(eviction_memcg);
>
> What if flushing is not rate-limited (e.g. above line is commented)?
>

Hmm I think I might be misunderstanding the question. The call to
mem_cgroup_flush_stats_ratelimited() does not ratelimit other
flushers, it is rather a flush call that is itself ratelimited. IOW,
it may or may not flush based on when was the last time someone else
flushed.

This was introduced because flushing in the fault path was expensive
in some cases, so we wanted to avoid flushing if someone else recently
did a flush, as we don't expect a lot of pending changes in this case.
However, that was when flushing was always on the root level. Now that
we are flushing on the memcg level, it may no longer be needed as:
- The flush is more scoped, there should be less work to do.
- There is a per-memcg threshold now such that we only flush when
there are pending updates in this memcg.

This is why I added a comment that the ratelimited flush here may no
longer be needed. I didn't want to investigate this as part of this
series, especially that I do not have a reproducer for the fault
latency introduced by the flush before ratelimiting. Hence, I am
leaving the comment such that people know that this ratelimiting may
no longer be needed with this patch.

> >
> >       eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
> >       refault = atomic_long_read(&eviction_lruvec->nonresident_age);
> > @@ -676,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> >               struct lruvec *lruvec;
> >               int i;
> >
> > -             mem_cgroup_flush_stats();
> > +             mem_cgroup_flush_stats(sc->memcg);
> >               lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> >               for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
> >                       pages += lruvec_page_state_local(lruvec,
>
> Confused...

Which part is confusing? The call to mem_cgroup_flush_stats() now
receives a memcg argument as flushing is scoped to that memcg only to
avoid doing unnecessary work to flush other memcgs with global
flushing.