On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > memory. > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > cache is the cause of list_lru_one allocation. > > > > crash> p memcg_nr_cache_ids > > memcg_nr_cache_ids = $2 = 24574 > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > can be calculated with the following formula. > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > crash> list super_blocks | wc -l > > 952 > > The more I see people trying to work around this, the more I think > that the way memcgs have been grafted into the list_lru is back to > front. > > We currently allocate scope for every memcg to be able to tracked on > every not on every superblock instantiated in the system, regardless > of whether that superblock is even accessible to that memcg. > > These huge memcg counts come from container hosts where memcgs are > confined to just a small subset of the total number of superblocks > that instantiated at any given point in time. > > IOWs, for these systems with huge container counts, list_lru does > not need the capability of tracking every memcg on every superblock. > > What it comes down to is that the list_lru is only needed for a > given memcg if that memcg is instatiating and freeing objects on a > given list_lru. > > Which makes me think we should be moving more towards "add the memcg > to the list_lru at the first insert" model rather than "instantiate > all at memcg init time just in case". The model we originally came > up with for supprting memcgs is really starting to show it's limits, > and we should address those limitations rahter than hack more > complexity into the system that does nothing to remove the > limitations that are causing the problems in the first place. I totally agree. It looks like the initial implementation of the whole kernel memory accounting and memcg-aware shrinkers was based on the idea that the number of memory cgroups is relatively small and stable. With systemd creating a separate cgroup for everything including short-living processes it simple not true anymore. Thanks!