On Thu 19-12-19 20:27:28, Roman Gushchin wrote: > Currently slab percpu vmstats are flushed twice: during the memcg > offlining and just before freeing the memcg structure. Each time > percpu counters are summed, added to the atomic counterparts and > propagated up by the cgroup tree. > > The second flushing is required due to how recursive vmstats are > implemented: counters are batched in percpu variables on a local > level, and once a percpu value is crossing some predefined threshold, > it spills over to atomic values on the local and each ascendant > levels. It means that without flushing some numbers cached in percpu > variables will be dropped on floor each time a cgroup is destroyed. > And with uptime the error on upper levels might become noticeable. > > The first flushing aims to make counters on ancestor levels more > precise. Dying cgroups may resume in the dying state for a long time. > After kmem_cache reparenting which is performed during the offlining > slab counters of the dying cgroup don't have any chances to be > updated, because any slab operations will be performed on the parent > level. It means that the inaccuracy caused by percpu batching > will not decrease up to the final destruction of the cgroup. > By the original idea flushing slab counters during the offlining > should minimize the visible inaccuracy of slab counters on the parent > level. > > The problem is that percpu counters are not zeroed after the first > flushing. So every cached percpu value is summed twice. It creates > a small error (up to 32 pages per cpu, but usually less) which > accumulates on parent cgroup level. After creating and destroying > of thousands of child cgroups, slab counter on parent level can > be way off the real value. > > For now, let's just stop flushing slab counters on memcg offlining. > It can't be done correctly without scheduling a work on each cpu: > reading and zeroing it during css offlining can race with an > asynchronous update, which doesn't expect values to be changed > underneath. > > With this change, slab counters on parent level will become eventually > consistent. Once all dying children are gone, values are correct. > And if not, the error is capped by 32 * NR_CPUS pages per dying > cgroup. > > It's not perfect, as slab are reparented, so any updates after > the reparenting will happen on the parent level. It means that > if a slab page was allocated, a counter on child level was bumped, > then the page was reparented and freed, the annihilation of positive > and negative counter values will not happen until the child cgroup is > released. It makes slab counters different from others, and it might > want us to implement flushing in a correct form again. > But it's also a question of performance: scheduling a work on each > cpu isn't free, and it's an open question if the benefit of having > more accurate counters is worth it. > > We might also consider flushing all counters on offlining, not only > slab counters. > > So let's fix the main problem now: make the slab counters eventually > consistent, so at least the error won't grow with uptime (or more > precisely the number of created and destroyed cgroups). And think > about the accuracy of counters separately. So this is essentially a revert, right? I have to say I was not a great fan of bee07b33db78 in the first place. > v2: added a note to the changelog, asked by Johannes. Thanks! > > Signed-off-by: Roman Gushchin <guro@xxxxxx> > Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining") > Cc: stable@xxxxxxxxxxxxxxx > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> Acked-by: Michal Hocko <mhocko@xxxxxxxx> > --- > include/linux/mmzone.h | 5 ++--- > mm/memcontrol.c | 37 +++++++++---------------------------- > 2 files changed, 11 insertions(+), 31 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 89d8ff06c9ce..5334ad8fc7bd 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -215,9 +215,8 @@ enum node_stat_item { > NR_INACTIVE_FILE, /* " " " " " */ > NR_ACTIVE_FILE, /* " " " " " */ > NR_UNEVICTABLE, /* " " " " " */ > - NR_SLAB_RECLAIMABLE, /* Please do not reorder this item */ > - NR_SLAB_UNRECLAIMABLE, /* and this one without looking at > - * memcg_flush_percpu_vmstats() first. */ > + NR_SLAB_RECLAIMABLE, > + NR_SLAB_UNRECLAIMABLE, > NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ > NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ > WORKINGSET_NODES, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 601405b207fb..3165db39827a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3287,49 +3287,34 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > } > } > > -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only) > +static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) > { > - unsigned long stat[MEMCG_NR_STAT]; > + unsigned long stat[MEMCG_NR_STAT] = {0}; > struct mem_cgroup *mi; > int node, cpu, i; > - int min_idx, max_idx; > - > - if (slab_only) { > - min_idx = NR_SLAB_RECLAIMABLE; > - max_idx = NR_SLAB_UNRECLAIMABLE; > - } else { > - min_idx = 0; > - max_idx = MEMCG_NR_STAT; > - } > - > - for (i = min_idx; i < max_idx; i++) > - stat[i] = 0; > > for_each_online_cpu(cpu) > - for (i = min_idx; i < max_idx; i++) > + for (i = 0; i < MEMCG_NR_STAT; i++) > stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); > > for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - for (i = min_idx; i < max_idx; i++) > + for (i = 0; i < MEMCG_NR_STAT; i++) > atomic_long_add(stat[i], &mi->vmstats[i]); > > - if (!slab_only) > - max_idx = NR_VM_NODE_STAT_ITEMS; > - > for_each_node(node) { > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > struct mem_cgroup_per_node *pi; > > - for (i = min_idx; i < max_idx; i++) > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > stat[i] = 0; > > for_each_online_cpu(cpu) > - for (i = min_idx; i < max_idx; i++) > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > stat[i] += per_cpu( > pn->lruvec_stat_cpu->count[i], cpu); > > for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) > - for (i = min_idx; i < max_idx; i++) > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > atomic_long_add(stat[i], &pi->lruvec_stat[i]); > } > } > @@ -3403,13 +3388,9 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg) > parent = root_mem_cgroup; > > /* > - * Deactivate and reparent kmem_caches. Then flush percpu > - * slab statistics to have precise values at the parent and > - * all ancestor levels. It's required to keep slab stats > - * accurate after the reparenting of kmem_caches. > + * Deactivate and reparent kmem_caches. > */ > memcg_deactivate_kmem_caches(memcg, parent); > - memcg_flush_percpu_vmstats(memcg, true); > > kmemcg_id = memcg->kmemcg_id; > BUG_ON(kmemcg_id < 0); > @@ -4913,7 +4894,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > * Flush percpu vmstats and vmevents to guarantee the value correctness > * on parent's and all ancestor levels. > */ > - memcg_flush_percpu_vmstats(memcg, false); > + memcg_flush_percpu_vmstats(memcg); > memcg_flush_percpu_vmevents(memcg); > __mem_cgroup_free(memcg); > } > -- > 2.17.1 -- Michal Hocko SUSE Labs