On Tue, Aug 21, 2018 at 03:10:52PM -0700, Shakeel Butt wrote: > On Tue, Aug 21, 2018 at 2:36 PM Roman Gushchin <guro@xxxxxx> wrote: > > > > If CONFIG_VMAP_STACK is set, kernel stacks are allocated > > using __vmalloc_node_range() with __GFP_ACCOUNT. So kernel > > stack pages are charged against corresponding memory cgroups > > on allocation and uncharged on releasing them. > > > > The problem is that we do cache kernel stacks in small > > per-cpu caches and do reuse them for new tasks, which can > > belong to different memory cgroups. > > > > Each stack page still holds a reference to the original cgroup, > > so the cgroup can't be released until the vmap area is released. > > > > To make this happen we need more than two subsequent exits > > without forks in between on the current cpu, which makes it > > very unlikely to happen. As a result, I saw a significant number > > of dying cgroups (in theory, up to 2 * number_of_cpu + > > number_of_tasks), which can't be released even by significant > > memory pressure. > > > > As a cgroup structure can take a significant amount of memory > > (first of all, per-cpu data like memcg statistics), it leads > > to a noticeable waste of memory. > > > > Signed-off-by: Roman Gushchin <guro@xxxxxx> > > Reviewed-by: Shakeel Butt <shakeelb@xxxxxxxxxx> > > BTW this makes a very good use-case for optimizing kmem uncharging > similar to what you did for skmem uncharging. The only thing I'm slightly worried here is that it can make reclaiming of memory cgroups harder. Probably, it's still ok, but let me first finish the work I'm doing on optimizing the whole memcg reclaim process, and then return to this case. Thanks!