[PATCH v2 00/28] The new cgroup slab memory controller

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The existing cgroup slab memory controller is based on the idea of
replicating slab allocator internals for each memory cgroup.
This approach promises a low memory overhead (one pointer per page),
and isn't adding too much code on hot allocation and release paths.
But is has a very serious flaw: it leads to a low slab utilization.

Using a drgn* script I've got an estimation of slab utilization on
a number of machines running different production workloads. In most
cases it was between 45% and 65%, and the best number I've seen was
around 85%. Turning kmem accounting off brings it to high 90s. Also
it brings back 30-50% of slab memory. It means that the real price
of the existing slab memory controller is way bigger than a pointer
per page.

The real reason why the existing design leads to a low slab utilization
is simple: slab pages are used exclusively by one memory cgroup.
If there are only few allocations of certain size made by a cgroup,
or if some active objects (e.g. dentries) are left after the cgroup is
deleted, or the cgroup contains a single-threaded application which is
barely allocating any kernel objects, but does it every time on a new CPU:
in all these cases the resulting slab utilization is very low.
If kmem accounting is off, the kernel is able to use free space
on slab pages for other allocations.

Arguably it wasn't an issue back to days when the kmem controller was
introduced and was an opt-in feature, which had to be turned on
individually for each memory cgroup. But now it's turned on by default
on both cgroup v1 and v2. And modern systemd-based systems tend to
create a large number of cgroups.

This patchset provides a new implementation of the slab memory controller,
which aims to reach a much better slab utilization by sharing slab pages
between multiple memory cgroups. Below is the short description of the new
design (more details in commit messages).

Accounting is performed per-object instead of per-page. Slab-related
vmstat counters are converted to bytes. Charging is performed on page-basis,
with rounding up and remembering leftovers.

Memcg ownership data is stored in a per-slab-page vector: for each slab page
a vector of corresponding size is allocated. To keep slab memory reparenting
working, instead of saving a pointer to the memory cgroup directly an
intermediate object is used. It's simply a pointer to a memcg (which can be
easily changed to the parent) with a built-in reference counter. This scheme
allows to reparent all allocated objects without walking them over and
changing memcg pointer to the parent.

Instead of creating an individual set of kmem_caches for each memory cgroup,
two global sets are used: the root set for non-accounted and root-cgroup
allocations and the second set for all other allocations. This allows to
simplify the lifetime management of individual kmem_caches: they are
destroyed with root counterparts. It allows to remove a good amount of code
and make things generally simpler.

The patchset* has been tested on a number of different workloads in our
production. In all cases it saved significant amount of memory, measured
from high hundreds of MBs to single GBs per host. On average, the size
of slab memory has been reduced by 35-45%.

(* These numbers were received used a backport of this patchset to the kernel
version used in fb production. But similar numbers can be obtained on
a vanilla kernel. On my personal desktop with 8-cores CPU and 16 GB of RAM
running Fedora 31 the new slab controller saves ~45-50% of slab memory,
measured just after loading of the system).

Additionally, it should lead to a lower memory fragmentation, just because
of a smaller number of non-movable pages and also because there is no more
need to move all slab objects to a new set of pages when a workload is
restarted in a new memory cgroup.

The patchset consists of several blocks:
patches (1)-(6) clean up the existing kmem accounting API,
patches (7)-(13) prepare vmstat to count individual slab objects,
patches (14)-(21) implement the main idea of the patchset,
patches (22)-(25) are following clean-ups of the memcg/slab code,
patches (26)-(27) implement a drgn-based replacement for per-memcg slabinfo,
patch (28) add kselftests covering kernel memory accounting functionality.


* https://github.com/osandov/drgn

v2:
  1) implemented re-layering and renaming suggested by Johannes,
    added his patch to the set. Thanks!
  2) fixed the issue discovered by Bharata B Rao. Thanks!
  3) added kmem API clean up part
  4) added slab/memcg follow-up clean up part
  5) fixed a couple of issues discovered by internal testing on FB fleet.
  6) added kselftests
  7) included metadata into the charge calculation
  8) refreshed commit logs, regrouped patches, rebased onto mm tree, etc

v1:
  1) fixed a bug in zoneinfo_show_print()
  2) added some comments to the subpage charging API, a minor fix
  3) separated memory.kmem.slabinfo deprecation into a separate patch,
     provided a drgn-based replacement
  4) rebased on top of the current mm tree

RFC:
  https://lwn.net/Articles/798605/


Johannes Weiner (1):
  mm: memcontrol: decouple reference counting from page accounting

Roman Gushchin (27):
  mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments
  mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments
  mm: kmem: rename memcg_kmem_(un)charge() into
    memcg_kmem_(un)charge_page()
  mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg()
  mm: memcg/slab: cache page number in memcg_(un)charge_slab()
  mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to
    __memcg_kmem_(un)charge()
  mm: memcg/slab: introduce mem_cgroup_from_obj()
  mm: fork: fix kernel_stack memcg stats for various stack
    implementations
  mm: memcg/slab: rename __mod_lruvec_slab_state() into
    __mod_lruvec_obj_state()
  mm: memcg: introduce mod_lruvec_memcg_state()
  mm: slub: implement SLUB version of obj_to_index()
  mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  mm: vmstat: convert slab vmstat counter to bytes
  mm: memcg/slab: obj_cgroup API
  mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  mm: memcg/slab: save obj_cgroup for non-root slab objects
  mm: memcg/slab: charge individual slab objects instead of pages
  mm: memcg/slab: deprecate memory.kmem.slabinfo
  mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  mm: memcg/slab: simplify memcg cache creation
  mm: memcg/slab: deprecate memcg_kmem_get_cache()
  mm: memcg/slab: deprecate slab_root_caches
  mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  tools/cgroup: add slabinfo.py tool
  tools/cgroup: make slabinfo.py compatible with new slab controller
  kselftests: cgroup: add kernel memory accounting tests

 drivers/base/node.c                        |  14 +-
 fs/pipe.c                                  |   2 +-
 fs/proc/meminfo.c                          |   4 +-
 include/linux/memcontrol.h                 | 147 ++++-
 include/linux/mm.h                         |  25 +-
 include/linux/mm_types.h                   |   5 +-
 include/linux/mmzone.h                     |  12 +-
 include/linux/slab.h                       |   5 +-
 include/linux/slub_def.h                   |   9 +
 include/linux/vmstat.h                     |   8 +
 kernel/fork.c                              |  13 +-
 kernel/power/snapshot.c                    |   2 +-
 mm/list_lru.c                              |  12 +-
 mm/memcontrol.c                            | 638 +++++++++++++--------
 mm/oom_kill.c                              |   2 +-
 mm/page_alloc.c                            |  12 +-
 mm/slab.c                                  |  36 +-
 mm/slab.h                                  | 346 +++++------
 mm/slab_common.c                           | 513 ++---------------
 mm/slob.c                                  |  12 +-
 mm/slub.c                                  |  62 +-
 mm/vmscan.c                                |   3 +-
 mm/vmstat.c                                |  37 +-
 mm/workingset.c                            |   6 +-
 tools/cgroup/slabinfo.py                   | 220 +++++++
 tools/testing/selftests/cgroup/.gitignore  |   1 +
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kmem.c | 380 ++++++++++++
 28 files changed, 1505 insertions(+), 1023 deletions(-)
 create mode 100755 tools/cgroup/slabinfo.py
 create mode 100644 tools/testing/selftests/cgroup/test_kmem.c

-- 
2.24.1





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux