On Fri, Feb 05, 2021 at 04:05:20PM +0100, Michal Hocko wrote: > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > Replace the memory controller's custom hierarchical stats code with > > the generic rstat infrastructure provided by the cgroup core. > > > > The current implementation does batched upward propagation from the > > write side (i.e. as stats change). The per-cpu batches introduce an > > error, which is multiplied by the number of subgroups in a tree. In > > systems with many CPUs and sizable cgroup trees, the error can be > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > subgroups results in an error of up to 128M per stat item). This can > > entirely swallow allocation bursts inside a workload that the user is > > expecting to see reflected in the statistics. > > > > In the past, we've done read-side aggregation, where a memory.stat > > read would have to walk the entire subtree and add up per-cpu > > counts. This became problematic with lazily-freed cgroups: we could > > have large subtrees where most cgroups were entirely idle. Hence the > > switch to change-driven upward propagation. Unfortunately, it needed > > to trade accuracy for speed due to the write side being so hot. > > > > Rstat combines the best of both worlds: from the write side, it > > cheaply maintains a queue of cgroups that have pending changes, so > > that the read side can do selective tree aggregation. This way the > > reported stats will always be precise and recent as can be, while the > > aggregation can skip over potentially large numbers of idle cgroups. > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > The above confused me a bit. I can see the pcp data size increased by > adding _prev. The resulting memory footprint should be increased by > sizeof(long) * (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) * (CPUS + 1) > which is roughly 1kB per CPU per memcg unless I have made any > mistake. This is a quite a lot and it should be mentioned in the > changelog. Not quite, you missed a hunk further below in the patch. Yes, the _prev arrays are added to the percpu struct. HOWEVER, we used to have TWO percpu structs in a memcg: one for local data, one for hierarchical data. In the rstat format, one is enough to capture both: - /* Legacy local VM stats and events */ - struct memcg_vmstats_percpu __percpu *vmstats_local; - - /* Subtree VM stats and events (batched updates) */ struct memcg_vmstats_percpu __percpu *vmstats_percpu; This eliminates dead duplicates of the nr_page_events and targets[MEM_CGROUP_NTARGETS(2)] we used to carry, which means we have a net reduction of 3 longs in the percpu data with this series. > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> > > Although the memory overhead is quite large and it scales both with > memcg count and CPUs so it can grow quite a bit I do not think this is > prohibitive. Although it would be really nice if this could be optimized > in the future. > > All that being said, the code looks more manageable now. > Acked-by: Michal Hocko <mhocko@xxxxxxxx> Thanks