Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal

Roman Gushchin <guro@xxxxxx> · Thu, 29 Apr 2021 18:39:40 -0700

On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > In our server, we found a suspected memory leak problem. The kmalloc-32
> > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > memory.
> > 
> > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > cache is the cause of list_lru_one allocation.
> > 
> >   crash> p memcg_nr_cache_ids
> >   memcg_nr_cache_ids = $2 = 24574
> > 
> > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > can be calculated with the following formula.
> > 
> >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > 
> > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > 
> >   crash> list super_blocks | wc -l
> >   952
> 
> The more I see people trying to work around this, the more I think
> that the way memcgs have been grafted into the list_lru is back to
> front.
> 
> We currently allocate scope for every memcg to be able to tracked on
> every not on every superblock instantiated in the system, regardless
> of whether that superblock is even accessible to that memcg.
> 
> These huge memcg counts come from container hosts where memcgs are
> confined to just a small subset of the total number of superblocks
> that instantiated at any given point in time.
> 
> IOWs, for these systems with huge container counts, list_lru does
> not need the capability of tracking every memcg on every superblock.
> 
> What it comes down to is that the list_lru is only needed for a
> given memcg if that memcg is instatiating and freeing objects on a
> given list_lru.
> 
> Which makes me think we should be moving more towards "add the memcg
> to the list_lru at the first insert" model rather than "instantiate
> all at memcg init time just in case". The model we originally came
> up with for supprting memcgs is really starting to show it's limits,
> and we should address those limitations rahter than hack more
> complexity into the system that does nothing to remove the
> limitations that are causing the problems in the first place.

I totally agree.

It looks like the initial implementation of the whole kernel memory accounting
and memcg-aware shrinkers was based on the idea that the number of memory
cgroups is relatively small and stable. With systemd creating a separate cgroup
for everything including short-living processes it simple not true anymore.

Thanks!