The existing cgroup slab memory controller is based on the idea of replicating slab allocator internals for each memory cgroup. This approach promises a low memory overhead (one pointer per page), and isn't adding too much code on hot allocation and release paths. But is has a very serious flaw: it leads to a low slab utilization. Using a drgn* script I've got an estimation of slab utilization on a number of machines running different production workloads. In most cases it was between 45% and 65%, and the best number I've seen was around 85%. Turning kmem accounting off brings it to high 90s. Also it brings back 30-50% of slab memory. It means that the real price of the existing slab memory controller is way bigger than a pointer per page. The real reason why the existing design leads to a low slab utilization is simple: slab pages are used exclusively by one memory cgroup. If there are only few allocations of certain size made by a cgroup, or if some active objects (e.g. dentries) are left after the cgroup is deleted, or the cgroup contains a single-threaded application which is barely allocating any kernel objects, but does it every time on a new CPU: in all these cases the resulting slab utilization is very low. If kmem accounting is off, the kernel is able to use free space on slab pages for other allocations. Arguably it wasn't an issue back to days when the kmem controller was introduced and was an opt-in feature, which had to be turned on individually for each memory cgroup. But now it's turned on by default on both cgroup v1 and v2. And modern systemd-based systems tend to create a large number of cgroups. This patchset provides a new implementation of the slab memory controller, which aims to reach a much better slab utilization by sharing slab pages between multiple memory cgroups. Below is the short description of the new design (more details in commit messages). Accounting is performed per-object instead of per-page. Slab-related vmstat counters are converted to bytes. Charging is performed on page-basis, with rounding up and remembering leftovers. Memcg ownership data is stored in a per-slab-page vector: for each slab page a vector of corresponding size is allocated. To keep slab memory reparenting working, instead of saving a pointer to the memory cgroup directly an intermediate object is used. It's simply a pointer to a memcg (which can be easily changed to the parent) with a built-in reference counter. This scheme allows to reparent all allocated objects without walking them over and changing memcg pointer to the parent. Instead of creating an individual set of kmem_caches for each memory cgroup, two global sets are used: the root set for non-accounted and root-cgroup allocations and the second set for all other allocations. This allows to simplify the lifetime management of individual kmem_caches: they are destroyed with root counterparts. It allows to remove a good amount of code and make things generally simpler. The patchset* has been tested on a number of different workloads in our production. In all cases it saved significant amount of memory, measured from high hundreds of MBs to single GBs per host. On average, the size of slab memory has been reduced by 35-45%. (* These numbers were received used a backport of this patchset to the kernel version used in fb production. But similar numbers can be obtained on a vanilla kernel. On my personal desktop with 8-cores CPU and 16 GB of RAM running Fedora 31 the new slab controller saves ~45-50% of slab memory, measured just after loading of the system). Additionally, it should lead to a lower memory fragmentation, just because of a smaller number of non-movable pages and also because there is no more need to move all slab objects to a new set of pages when a workload is restarted in a new memory cgroup. The patchset consists of several blocks: patches (1)-(6) clean up the existing kmem accounting API, patches (7)-(13) prepare vmstat to count individual slab objects, patches (14)-(21) implement the main idea of the patchset, patches (22)-(25) are following clean-ups of the memcg/slab code, patches (26)-(27) implement a drgn-based replacement for per-memcg slabinfo, patch (28) add kselftests covering kernel memory accounting functionality. * https://github.com/osandov/drgn v2: 1) implemented re-layering and renaming suggested by Johannes, added his patch to the set. Thanks! 2) fixed the issue discovered by Bharata B Rao. Thanks! 3) added kmem API clean up part 4) added slab/memcg follow-up clean up part 5) fixed a couple of issues discovered by internal testing on FB fleet. 6) added kselftests 7) included metadata into the charge calculation 8) refreshed commit logs, regrouped patches, rebased onto mm tree, etc v1: 1) fixed a bug in zoneinfo_show_print() 2) added some comments to the subpage charging API, a minor fix 3) separated memory.kmem.slabinfo deprecation into a separate patch, provided a drgn-based replacement 4) rebased on top of the current mm tree RFC: https://lwn.net/Articles/798605/ Johannes Weiner (1): mm: memcontrol: decouple reference counting from page accounting Roman Gushchin (27): mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg() mm: memcg/slab: cache page number in memcg_(un)charge_slab() mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() mm: memcg/slab: introduce mem_cgroup_from_obj() mm: fork: fix kernel_stack memcg stats for various stack implementations mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state() mm: memcg: introduce mod_lruvec_memcg_state() mm: slub: implement SLUB version of obj_to_index() mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat mm: vmstat: convert slab vmstat counter to bytes mm: memcg/slab: obj_cgroup API mm: memcg/slab: allocate obj_cgroups for non-root slab pages mm: memcg/slab: save obj_cgroup for non-root slab objects mm: memcg/slab: charge individual slab objects instead of pages mm: memcg/slab: deprecate memory.kmem.slabinfo mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h mm: memcg/slab: use a single set of kmem_caches for all memory cgroups mm: memcg/slab: simplify memcg cache creation mm: memcg/slab: deprecate memcg_kmem_get_cache() mm: memcg/slab: deprecate slab_root_caches mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() tools/cgroup: add slabinfo.py tool tools/cgroup: make slabinfo.py compatible with new slab controller kselftests: cgroup: add kernel memory accounting tests drivers/base/node.c | 14 +- fs/pipe.c | 2 +- fs/proc/meminfo.c | 4 +- include/linux/memcontrol.h | 147 ++++- include/linux/mm.h | 25 +- include/linux/mm_types.h | 5 +- include/linux/mmzone.h | 12 +- include/linux/slab.h | 5 +- include/linux/slub_def.h | 9 + include/linux/vmstat.h | 8 + kernel/fork.c | 13 +- kernel/power/snapshot.c | 2 +- mm/list_lru.c | 12 +- mm/memcontrol.c | 638 +++++++++++++-------- mm/oom_kill.c | 2 +- mm/page_alloc.c | 12 +- mm/slab.c | 36 +- mm/slab.h | 346 +++++------ mm/slab_common.c | 513 ++--------------- mm/slob.c | 12 +- mm/slub.c | 62 +- mm/vmscan.c | 3 +- mm/vmstat.c | 37 +- mm/workingset.c | 6 +- tools/cgroup/slabinfo.py | 220 +++++++ tools/testing/selftests/cgroup/.gitignore | 1 + tools/testing/selftests/cgroup/Makefile | 2 + tools/testing/selftests/cgroup/test_kmem.c | 380 ++++++++++++ 28 files changed, 1505 insertions(+), 1023 deletions(-) create mode 100755 tools/cgroup/slabinfo.py create mode 100644 tools/testing/selftests/cgroup/test_kmem.c -- 2.24.1