On Tue, Aug 11, 2020 at 11:27:37AM -0400, Johannes Weiner wrote: > On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote: > > Memory cgroups are using large chunks of percpu memory to store vmstat > > data. Yet this memory is not accounted at all, so in the case when there > > are many (dying) cgroups, it's not exactly clear where all the memory is. > > > > Because the size of memory cgroup internal structures can dramatically > > exceed the size of object or page which is pinning it in the memory, it's > > not a good idea to simple ignore it. It actually breaks the isolation > > between cgroups. > > > > Let's account the consumed percpu memory to the parent cgroup. > > > > Signed-off-by: Roman Gushchin <guro@xxxxxx> > > Acked-by: Dennis Zhou <dennis@xxxxxxxxxx> > > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> Thank you! > > This makes sense, and the accounting is in line with how we track and > distribute child creation quotas (cgroup.max.descendants and > cgroup.max.depth) up the cgroup tree. > > I have one minor comment that isn't a dealbreaker for me: > > > @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) > > if (!pn) > > return 1; > > > > - pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat); > > + pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat, > > + GFP_KERNEL_ACCOUNT); > > if (!pn->lruvec_stat_local) { > > kfree(pn); > > return 1; > > } > > > > - pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat); > > + pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat, > > + GFP_KERNEL_ACCOUNT); > > if (!pn->lruvec_stat_cpu) { > > free_percpu(pn->lruvec_stat_local); > > kfree(pn); > > @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > > goto fail; > > } > > > > - memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu); > > + memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, > > + GFP_KERNEL_ACCOUNT); > > if (!memcg->vmstats_local) > > goto fail; > > > > - memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu); > > + memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, > > + GFP_KERNEL_ACCOUNT); > > if (!memcg->vmstats_percpu) > > goto fail; > > > > @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > > struct mem_cgroup *memcg; > > long error = -ENOMEM; > > > > + memalloc_use_memcg(parent); > > memcg = mem_cgroup_alloc(); > > + memalloc_unuse_memcg(); > > The disconnect between 1) requesting accounting and 2) which cgroup to > charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a > handgrenade, because accounting to the current task is almost > guaranteed to be wrong if the use_memcg() annotation were to get lost > in a refactor or not make it to a new caller of the function. > > The saving grace is that mem_cgroup_alloc() is pretty unlikely to be > used elsewhere. And pretending it's an independent interface would be > overengineering. But how about the following in mem_cgroup_alloc() and > alloc_mem_cgroup_per_node_info() to document that caller relationship: > > /* We charge the parent cgroup, never the current task */ > WARN_ON_ONCE(!current->active_memcg); I have nothing against. Andrew, can you, please, squash the following diff into the patch? Thanks! -- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 130093bdf74b..e25f2db7e61c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5137,6 +5137,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return 1; + /* We charge the parent cgroup, never the current task */ + WARN_ON_ONCE(!current->active_memcg); + pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat, GFP_KERNEL_ACCOUNT); if (!pn->lruvec_stat_local) { @@ -5219,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void) goto fail; } + /* We charge the parent cgroup, never the current task */ + WARN_ON_ONCE(!current->active_memcg); + memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_local)