Hi Kernel Maintainers, We are chasing an issue where slab allocator is not releasing task_struct slab objects allocated by cgroups and we are wondering if this is a known issue or an expected behavior ? If we stress test the system and spawn multiple tasks with different cgroups, number of active allocated task_struct objects will increase but kernel will never release those memory later on, even though if system goes to the idle state with lower number of the running processes. To test this, we have prepared a bash script that would create 1000 cgroups and it will spawn 100,000 bash tasks. The full script and its test result is available on github : https://github.com/saeedsk/slab-allocator-test Here is a quick snapshot of the test result before and after running multiple concurrent tasks with different cgroups: ------------- system initial statistics ------------- Slab: 419196 kB SReclaimable: 123788 kB SUnreclaim: 295,408 kB # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> task_struct 735 990 5888 5 8 : tunables 0 0 0 : slabdata 198 198 0 Number of running processes before starting the test : 334 ...... loading 100,000 time bounded tasks with 1000 mem cgroups .............. ..... wait until are tasks are complete , normally within next 5 seconds ........ ------------- after tasks are loaded and completed running ------------- Slab: 948932 kB SReclaimable: 125816 kB SUnreclaim: 823,116 kB # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> task_struct 11404 11665 5888 5 8 : tunables 0 0 0 : slabdata 2333 2333 0 Number of running processes when the test is completed : 334 As it is shown above, number of active task_struct slabs has been increased from 736 to 11404 objects during the test. System keeps 11404 task_struct objects in the idle time where only 334 tasks is running. This huge number of active task_struct slabs it is not normal and a huge fraction of that memory can be - released to system memory pool. If we write to slab's shrink systf entry, then kernel will release deactivated objects and it will free up the related memory, but it is not happening automatically by kernel as it was expected. Following line is the command that would release those zombie objects: # for file in /sys/kernel/slab/*; do echo 1 > $file/shrink; done We know that some of slab caches are supposed to remain allocated until system really need that memory. So in one test we tried to consume all available system memory in a hope that kernel would release the above Memory but it didn't happened and "out of memory killer" started killing processes and no memory got released by kernel slab allocator. In recent systemd releases, CGroup memory accounting has been enabled by default and systemd will create multiple cgroups to run different software daemons. Although we have called this test as an stress test but this situation may happen in normal system boot time where systemd is trying to load and run multiple instances of programs daemons with different cgroups. This issue only manifest itself when cgroup are actively in use. I've confirmed that this issue is present in Kernel V4.19.66, Kernel V5.0.0 (Ubuntu 19.04) and latest Kernel Release V5.3.0. Any comment and or hint would be greatly appreciated. Here is some related kernel configuration while this test were done: $ grep SLAB .config # CONFIG_SLAB is not set CONFIG_SLAB_MERGE_DEFAULT=y # CONFIG_SLAB_FREELIST_RANDOM is not set # CONFIG_SLAB_FREELIST_HARDENED is not set #grep SLUB .config CONFIG_SLUB_DEBUG=y # CONFIG_SLUB_MEMCG_SYSFS_ON is not set CONFIG_SLUB=y CONFIG_SLUB_CPU_PARTIAL=y # CONFIG_SLUB_DEBUG_ON is not set # CONFIG_SLUB_STATS is not set $ grep KMEM .config CONFIG_MEMCG_KMEM=y # CONFIG_DEVKMEM is not set CONFIG_HAVE_DEBUG_KMEMLEAK=y # CONFIG_DEBUG_KMEMLEAK is not set Thanks, Saeed Karimabadi Cisco Systems Inc.