On Wed, Dec 27, 2023 at 11:31 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > One of our internal workload regressed on newer upstream kernel and on > further investigation, it seems like the cause is the always synchronous > rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b > ("mm: memcg: use rstat for non-hierarchical stats"). On further > inspection it seems like we don't really need accurate stats in this > function as it was already approximating the amount of appropriate > shadow entried to keep for maintaining the refault information. Since s/entried/entries > there is already 2 sec periodic rstat flush, we don't need exact stats > here. Let's ratelimit the rstat flush in this code path. Is the regression observed even with commit 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing")? I think the answer is yes based on internal discussions, but this really surprises me. Commit f82e6bf9bb9b removed the percpu loop in lruvec_page_state_local(), and added a flush call. With 7d7ef0a4686a, the flush call is only effective if there are pending updates in the cgroup subtree exceeding MEMCG_CHARGE_BATCH * num_online_cpus(). IOW, we should only be doing work when actually needed, whereas before we used to have multiple percpu loops in count_shadow_nodes() regardless of pending updates. It seems like the cgroup subtree is very active that we continuously need to flush in count_shadow_nodes()? If that's the case, do we still think it's okay not to flush when we know there are pending updates? I don't have enough background about the workingset heuristics to judge this. I am not objecting to this change, I am just trying to understand what's happening. Thanks! > > Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") > Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx> > --- > mm/workingset.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/workingset.c b/mm/workingset.c > index 2a2a34234df9..226012974328 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > struct lruvec *lruvec; > int i; > > - mem_cgroup_flush_stats(sc->memcg); > + mem_cgroup_flush_stats_ratelimited(sc->memcg); > lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) > pages += lruvec_page_state_local(lruvec, > -- > 2.43.0.472.g3155946c3a-goog >