CGroup unused allocated slab objects will not get released

"Saeed Karimabadi (skarimab)" <skarimab@xxxxxxxxx> · Wed, 18 Sep 2019 20:31:18 +0000

Hi  Kernel Maintainers,

We are chasing an issue where slab allocator is not releasing task_struct slab objects allocated by cgroups 
and we are wondering if this is a known issue or an expected behavior ?
If we stress test the system and spawn multiple tasks with different cgroups, number of active allocated 
task_struct objects will increase but kernel will never release those memory later on, even though if system 
goes to the idle state with lower number of the running processes.
To test this, we have prepared a bash script that would create 1000 cgroups and it will spawn 100,000 bash 
tasks. The full script and its test result is available on github :

https://github.com/saeedsk/slab-allocator-test

Here is a quick snapshot of the test result before and after running multiple concurrent tasks with different cgroups:

------------- system initial statistics -------------
Slab:             419196 kB
SReclaimable:     123788 kB
SUnreclaim:       295,408 kB
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> 
		: slabdata <active_slabs> <num_slabs> <sharedavail>
task_struct          735    990   5888    5    8 : tunables    0    0    0 : slabdata    198    198      0
Number of running processes before starting the test : 334

...... loading 100,000 time bounded tasks with 1000 mem cgroups .............. 
..... wait until are tasks are complete , normally within next 5 seconds ........

------------- after tasks are loaded and completed running  -------------
Slab:             948932 kB
SReclaimable:     125816 kB
SUnreclaim:       823,116 kB
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> 
		: slabdata <active_slabs> <num_slabs> <sharedavail>
task_struct        11404  11665   5888    5    8 : tunables    0    0    0 : slabdata   2333   2333      0
Number of running processes when the test is completed : 334

As it is shown above, number of active task_struct slabs has been increased from 736 to 11404 objects 
during the test. System keeps 11404 task_struct objects in the idle time where only 334 tasks is running. 
This huge number of active task_struct slabs it is not normal and a huge fraction of that memory can be -
released to system memory pool. If we write to slab's shrink systf entry, then kernel will release deactivated
objects and it will free up the related memory, but it is not happening automatically by kernel as it was 
expected.

Following line is the command that would release those zombie objects:
# for file in /sys/kernel/slab/*; do echo 1 > $file/shrink; done

We know that some of slab caches are supposed to remain allocated until system really need that memory. 
So in one test we tried to consume all available system memory in a hope that kernel would release the above 
Memory but it didn't happened and "out of memory killer" started killing processes and no memory got released 
by kernel slab allocator.

In recent systemd releases, CGroup memory accounting has been enabled by default and systemd will 
create multiple cgroups to run different software daemons. Although we have called this test as 
an stress test but this situation may happen in normal system boot time where systemd is trying
to load and run multiple instances of programs daemons with different cgroups.
This issue only manifest itself when cgroup are actively in use. I've confirmed that this issue is present
 in Kernel V4.19.66, Kernel V5.0.0 (Ubuntu 19.04) and latest Kernel Release V5.3.0.
Any comment and or hint would be greatly appreciated.
Here is some related kernel configuration while this test were done:

$ grep SLAB  .config
# CONFIG_SLAB is not set
CONFIG_SLAB_MERGE_DEFAULT=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set

#grep SLUB  .config
CONFIG_SLUB_DEBUG=y
# CONFIG_SLUB_MEMCG_SYSFS_ON is not set
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set

$ grep KMEM  .config
CONFIG_MEMCG_KMEM=y
# CONFIG_DEVKMEM is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set

Thanks,
Saeed Karimabadi
Cisco Systems Inc.