On Sat, Sep 18, 2021 at 2:56 PM Kari Argillander <kari.argillander@xxxxxxxxx> wrote: > > On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote: > > We introduced alloc_inode_sb() in previous version 2, which sets up the > > inode reclaim context properly, to allocate filesystems specific inode. > > So we have to convert to new API for all filesystems, which is done in > > one patch. Some filesystems are easy to convert (just replace > > kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to > > do more work. In order to make it easy for maintainers of different > > filesystems to review their own maintained part, I split the patch into > > patches which are per-filesystem in this version. I am not sure if this > > is a good idea, because there is going to be more commits. > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > memory. > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > cache is the cause of list_lru_one allocation. > > > > crash> p memcg_nr_cache_ids > > memcg_nr_cache_ids = $2 = 24574 > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > can be calculated with the following formula. > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > crash> list super_blocks | wc -l > > 952 > > > > Every mount will register 2 list lrus, one is for inode, another is for > > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 > > MB (~5.6GB). But now the number of memory cgroups is less than 500. So I > > guess more than 12286 memory cgroups have been created on this machine (I > > do not know why there are so many cgroups, it may be a user's bug or > > the user really want to do that). Because memcg_nr_cache_ids has not been > > reduced to a suitable value. It leads to waste a lot of memory. If we want > > to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not > > what we want. > > > > In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do > > this. But this did not fundamentally solve the problem. > > > > We currently allocate scope for every memcg to be able to tracked on every > > superblock instantiated in the system, regardless of whether that superblock > > is even accessible to that memcg. > > > > These huge memcg counts come from container hosts where memcgs are confined > > to just a small subset of the total number of superblocks that instantiated > > at any given point in time. > > > > For these systems with huge container counts, list_lru does not need the > > capability of tracking every memcg on every superblock. > > > > What it comes down to is that the list_lru is only needed for a given memcg > > if that memcg is instatiating and freeing objects on a given list_lru. > > > > As Dave said, "Which makes me think we should be moving more towards 'add the > > memcg to the list_lru at the first insert' model rather than 'instantiate > > all at memcg init time just in case'." > > > > This patchset aims to optimize the list lru memory consumption from different > > aspects. > > > > Patch 1-6 are code simplification. > > Patch 7 converts the array from per-memcg per-node to per-memcg > > Patch 8 introduces kmem_cache_alloc_lru() > > Patch 9 introduces alloc_inode_sb() > > Patch 10-66 convert all filesystems to alloc_inode_sb() respectively. > > There is now days also ntfs3. If you do not plan to convert this please > CC me atleast so that I can do it when these lands. > > Argillander > Wow, a new filesystem. I didn't notice it before. I'll cover it in the next version and Cc you if you can do a review. Thanks for your reminder.