On Mon, Apr 05, 2021 at 11:08:26AM -0700, Yang Shi wrote: > On Sun, Apr 4, 2021 at 10:49 PM Bharata B Rao <bharata@xxxxxxxxxxxxx> wrote: > > > > Hi, > > > > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9 > > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory > > consumption increases quite a lot (around 172G) when the containers are > > running. Most of it comes from slab (149G) and within slab, the majority of > > it comes from kmalloc-32 cache (102G) > > > > The major allocator of kmalloc-32 slab cache happens to be the list_head > > allocations of list_lru_one list. These lists are created whenever a > > FS mount happens. Specially two such lists are registered by alloc_super(), > > one for dentry and another for inode shrinker list. And these lists > > are created for all possible NUMA nodes and for all given memcgs > > (memcg_nr_cache_ids to be particular) > > > > If, > > > > A = Nr allocation request per mount: 2 (one for dentry and inode list) > > B = Nr NUMA possible nodes > > C = memcg_nr_cache_ids > > D = size of each kmalloc-32 object: 32 bytes, > > > > then for every mount, the amount of memory consumed by kmalloc-32 slab > > cache for list_lru creation is A*B*C*D bytes. > > Yes, this is exactly what the current implementation does. > > > > > Following factors contribute to the excessive allocations: > > > > - Lists are created for possible NUMA nodes. > > Yes, because filesystem caches (dentry and inode) are NUMA aware. True, but creating lists for possible nodes as against online nodes can hurt platforms where possible is typically higher than online. > > > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional > > list_lrus are created when it grows. Thus we end up creating list_lru_one > > list_heads even for those memcgs which are yet to be created. > > For example, when 10000 memcgs are created, memcg_nr_cache_ids reach > > a value of 12286. > > - When a memcg goes offline, the list elements are drained to the parent > > memcg, but the list_head entry remains. > > - The lists are destroyed only when the FS is unmounted. So list_heads > > for non-existing memcgs remain and continue to contribute to the > > kmalloc-32 allocation. This is presumably done for performance > > reason as they get reused when new memcgs are created, but they end up > > consuming slab memory until then. > > The current implementation has list_lrus attached with super_block. So > the list can't be freed until the super block is unmounted. > > I'm looking into consolidating list_lrus more closely with memcgs. It > means the list_lrus will have the same life cycles as memcgs rather > than filesystems. This may be able to improve some. But I'm supposed > the filesystem will be unmounted once the container exits and the > memcgs will get offlined for your usecase. Yes, but when the containers are still running, the lists that get created for non-existing memcgs and non-relavent memcgs are the main cause of increased memory consumption. > > > - In case of containers, a few file systems get mounted and are specific > > to the container namespace and hence to a particular memcg, but we > > end up creating lists for all the memcgs. > > Yes, because the kernel is *NOT* aware of containers. > > > As an example, if 7 FS mounts are done for every container and when > > 10k containers are created, we end up creating 2*7*12286 list_lru_one > > lists for each NUMA node. It appears that no elements will get added > > to other than 2*7=14 of them in the case of containers. > > > > One straight forward way to prevent this excessive list_lru_one > > allocations is to limit the list_lru_one creation only to the > > relevant memcg. However I don't see an easy way to figure out > > that relevant memcg from FS mount path (alloc_super()) > > > > As an alternative approach, I have this below hack that does lazy > > list_lru creation. The memcg-specific list is created and initialized > > only when there is a request to add an element to that particular > > list. Though I am not sure about the full impact of this change > > on the owners of the lists and also the performance impact of this, > > the overall savings look good. > > It is fine to reduce the memory consumption for your usecase, but I'm > not sure if this would incur any noticeable overhead for vfs > operations since list_lru_add() should be called quite often, but it > just needs to allocate the list for once (for each memcg + > filesystem), so the overhead might be fine. Let me run some benchmarks to measure the overhead. Any particular benchmark suggestion? > > And I'm wondering how much memory can be saved for real life workload. > I don't expect most containers are idle in production environments. I don't think kmalloc-32 slab cache memory consumption from list_lru would be any different for real life workload compared to idle containers. > > Added some more memcg/list_lru experts in this loop, they may have better ideas. Thanks. Regards, Bharata.