On Wed, Jul 19, 2023 at 05:46:13PM +0000, Yosry Ahmed wrote: > Currently, memcg uses rstat to maintain hierarchical stats. The rstat > framework keeps track of which cgroups have updates on which cpus. > > For non-hierarchical stats, as memcg moved to rstat, they are no longer > readily available as counters. Instead, the percpu counters for a given > stat need to be summed to get the non-hierarchical stat value. This > causes a performance regression when reading non-hierarchical stats on > kernels where memcg moved to using rstat. This is especially visible > when reading memory.stat on cgroup v1. There are also some code paths > internal to the kernel that read such non-hierarchical stats. It's actually not an rstat regression. It's always been this costly. Quick history: We used to maintain *all* stats in per-cpu counters at the local level. memory.stat reads would have to iterate and aggregate the entire subtree every time. This was obviously very costly, so we added batched upward propagation during stat updates to simplify reads: commit 42a300353577ccc17ecc627b8570a89fa1678bec Author: Johannes Weiner <hannes@xxxxxxxxxxx> Date: Tue May 14 15:47:12 2019 -0700 mm: memcontrol: fix recursive statistics correctness & scalabilty However, that caused a regression in the stat write path, as the upward propagation would bottleneck on the cachelines in the shared parents. The fix for *that* re-introduced the per-cpu loops in the local stat reads: commit 815744d75152078cde5391fc1e3c2d4424323fb6 Author: Johannes Weiner <hannes@xxxxxxxxxxx> Date: Thu Jun 13 15:55:46 2019 -0700 mm: memcontrol: don't batch updates of local VM stats and events So I wouldn't say it's a regression from rstat. Except for that short period between the two commits above, the read side for local stats was always expensive. rstat promises a shot at finally fixing it, with less risk to the write path. > It is inefficient to iterate and sum counters in all cpus when the rstat > framework knows exactly when a percpu counter has an update. Instead, > maintain cpu-aggregated non-hierarchical counters for each stat. During > an rstat flush, keep those updated as well. When reading > non-hierarchical stats, we no longer need to iterate cpus, we just need > to read the maintainer counters, similar to hierarchical stats. > > A caveat is that we now a stats flush before reading > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > or memcg_events_local(), where we previously only needed a flush to > read hierarchical stats. Most contexts reading non-hierarchical stats > are already doing a flush, add a flush to the only missing context in > count_shadow_nodes(). > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > machine with 256 cpus on cgroup v1: > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > real 0m0.125s > user 0m0.005s > sys 0m0.120s > > After: > real 0m0.032s > user 0m0.005s > sys 0m0.027s > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> But I want to be clear: this isn't a regression fix. It's a new performance optimization for the deprecated cgroup1 code. And it comes at the cost of higher memory footprint for both cgroup1 AND cgroup2. If this causes a regression, we should revert it again. But let's try.