On Fri, Apr 12, 2019 at 11:15:03AM -0400, Johannes Weiner wrote: > The cgroup memory.stat file holds recursive statistics for the entire > subtree. The current implementation does this tree walk on-demand > whenever the file is read. This is giving us problems in production. > > 1. The cost of aggregating the statistics on-demand is high. A lot of > system service cgroups are mostly idle and their stats don't change > between reads, yet we always have to check them. There are also always > some lazily-dying cgroups sitting around that are pinned by a handful > of remaining page cache; the same applies to them. > > In an application that periodically monitors memory.stat in our fleet, > we have seen the aggregation consume up to 5% CPU time. > > 2. When cgroups die and disappear from the cgroup tree, so do their > accumulated vm events. The result is that the event counters at > higher-level cgroups can go backwards and confuse some of our > automation, let alone people looking at the graphs over time. > > To address both issues, this patch series changes the stat > implementation to spill counts upwards when the counters change. > > The upward spilling is batched using the existing per-cpu cache. In a > sparse file stress test with 5 level cgroup nesting, the additional > cost of the flushing was negligible (a little under 1% of CPU at 100% > CPU utilization, compared to the 5% of reading memory.stat during > regular operation). > > include/linux/memcontrol.h | 96 +++++++------- > mm/memcontrol.c | 290 +++++++++++++++++++++++++++---------------- > mm/vmscan.c | 4 +- > mm/workingset.c | 7 +- > 4 files changed, 234 insertions(+), 163 deletions(-) > > For the series: Reviewed-by: Roman Gushchin <guro@xxxxxx> Thanks!