On Fri, Apr 12, 2019 at 12:55:10PM -0700, Shakeel Butt wrote: > On Fri, Apr 12, 2019 at 8:15 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > Right now, when somebody needs to know the recursive memory statistics > > and events of a cgroup subtree, they need to walk the entire subtree > > and sum up the counters manually. > > > > There are two issues with this: > > > > 1. When a cgroup gets deleted, its stats are lost. The state counters > > should all be 0 at that point, of course, but the events are not. When > > this happens, the event counters, which are supposed to be monotonic, > > can go backwards in the parent cgroups. > > > > We also faced this exact same issue as well and had the similar solution. > > > 2. During regular operation, we always have a certain number of lazily > > freed cgroups sitting around that have been deleted, have no tasks, > > but have a few cache pages remaining. These groups' statistics do not > > change until we eventually hit memory pressure, but somebody watching, > > say, memory.stat on an ancestor has to iterate those every time. > > > > This patch addresses both issues by introducing recursive counters at > > each level that are propagated from the write side when stats change. > > > > Upward propagation happens when the per-cpu caches spill over into the > > local atomic counter. This is the same thing we do during charge and > > uncharge, except that the latter uses atomic RMWs, which are more > > expensive; stat changes happen at around the same rate. In a sparse > > file test (page faults and reclaim at maximum CPU speed) with 5 cgroup > > nesting levels, perf shows __mod_memcg_page state at ~1%. > > > > (Unrelated to this patchset) I think there should also a way to get > the exact memcg stats. As the machines are getting bigger (more cpus > and larger basic page size) the accuracy of stats are getting worse. > Internally we have an additional interface memory.stat_exact for that. > However I am not sure in the upstream kernel will an additional > interface is better or something like /proc/sys/vm/stat_refresh which > sync all per-cpu stats. I was thinking about eventually consistent counters: sync them periodically from a worker thread. It should keep the cost of reading small, but should increase the accuracy. Will it work for you?