On Fri, Jun 07, 2024 at 03:48:06PM GMT, Jesper Dangaard Brouer wrote: > From: Shakeel Butt <shakeelb@xxxxxxxxxx> > > commit d4a5b369ad6d8aae552752ff438dddde653a72ec upstream. > > One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer > upstream kernel and on further investigation, it seems like the cause is > the always synchronous rstat flush in the count_shadow_nodes() added by > the commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical > stats"). On further inspection it seems like we don't really need > accurate stats in this function as it was already approximating the amount > of appropriate shadow entries to keep for maintaining the refault > information. Since there is already 2 sec periodic rstat flush, we don't > need exact stats here. Let's ratelimit the rstat flush in this code path. > > Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@xxxxxxxxxx > Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") > Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Cc: Yu Zhao <yuzhao@xxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx> > Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> > > --- > On production with kernel v6.6 we are observing issues with excessive > cgroup rstat flushing due to the extra call to mem_cgroup_flush_stats() > in count_shadow_nodes() introduced in commit f82e6bf9bb9b ("mm: memcg: > use rstat for non-hierarchical stats") that commit is part of v6.6. > We request backport of commit d4a5b369ad6d ("mm: ratelimit stat flush > from workingset shrinker") as it have a fixes tag for this commit. > > IMHO it is worth explaining call path that makes count_shadow_nodes() > cause excessive cgroup rstat flushing calls. Function shrink_node() > calls mem_cgroup_flush_stats() on its own first, and then invokes > shrink_node_memcgs(). Function shrink_node_memcgs() iterates over > cgroups via mem_cgroup_iter() for each calling shrink_slab(). The > shrink_slab() calls do_shrink_slab() that via shrinker->count_objects() > invoke count_shadow_nodes(), and count_shadow_nodes() does > a mem_cgroup_flush_stats() call, that seems unnecessary. > Actually at Meta production we have also replaced mem_cgroup_flush_stats() in shrink_node() with mem_cgroup_flush_stats_ratelimited() as it was causing too much flushing issue. We have not observed any issue after the change. I will propose that patch to upstream as well.