On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote: > Memory cgroups are using large chunks of percpu memory to store vmstat > data. Yet this memory is not accounted at all, so in the case when there > are many (dying) cgroups, it's not exactly clear where all the memory is. > > Because the size of memory cgroup internal structures can dramatically > exceed the size of object or page which is pinning it in the memory, it's > not a good idea to simple ignore it. It actually breaks the isolation > between cgroups. > > Let's account the consumed percpu memory to the parent cgroup. > > Signed-off-by: Roman Gushchin <guro@xxxxxx> > Acked-by: Dennis Zhou <dennis@xxxxxxxxxx> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> This makes sense, and the accounting is in line with how we track and distribute child creation quotas (cgroup.max.descendants and cgroup.max.depth) up the cgroup tree. I have one minor comment that isn't a dealbreaker for me: > @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) > if (!pn) > return 1; > > - pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat); > + pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat, > + GFP_KERNEL_ACCOUNT); > if (!pn->lruvec_stat_local) { > kfree(pn); > return 1; > } > > - pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat); > + pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat, > + GFP_KERNEL_ACCOUNT); > if (!pn->lruvec_stat_cpu) { > free_percpu(pn->lruvec_stat_local); > kfree(pn); > @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > goto fail; > } > > - memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu); > + memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, > + GFP_KERNEL_ACCOUNT); > if (!memcg->vmstats_local) > goto fail; > > - memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu); > + memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, > + GFP_KERNEL_ACCOUNT); > if (!memcg->vmstats_percpu) > goto fail; > > @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > struct mem_cgroup *memcg; > long error = -ENOMEM; > > + memalloc_use_memcg(parent); > memcg = mem_cgroup_alloc(); > + memalloc_unuse_memcg(); The disconnect between 1) requesting accounting and 2) which cgroup to charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a handgrenade, because accounting to the current task is almost guaranteed to be wrong if the use_memcg() annotation were to get lost in a refactor or not make it to a new caller of the function. The saving grace is that mem_cgroup_alloc() is pretty unlikely to be used elsewhere. And pretending it's an independent interface would be overengineering. But how about the following in mem_cgroup_alloc() and alloc_mem_cgroup_per_node_info() to document that caller relationship: /* We charge the parent cgroup, never the current task */ WARN_ON_ONCE(!current->active_memcg);