[PATCH 00/11] cgroup: separate rstat trees

JP Kobryn <inwardvessel@xxxxxxxxx> · Mon, 17 Feb 2025 19:14:37 -0800

The current design of rstat takes the approach that if one subsystem is to
be flushed, all other subsystem controllers with pending updates should
also be flushed. It seems that over time, the stat-keeping of some
subsystems has grown in size to the extent that they are noticeably slowing
down others. This has been most observable in situations where the memory
controller is enabled. One big area where the issue comes up is system
telemetry, where programs periodically sample cpu stats. It would be a
benefit for programs like this if the overhead of flushing memory stats
(and others) could be eliminated. It would save cpu cycles for existing
cpu-based telemetry programs and improve scalability in terms of increasing
sampling frequency and volume of hosts the program is running on - the cpu
cycles saved helps free up the budget for cycles to be spent on the desired
stats.

This series changes the approach of "flush all subsystems" to "flush only the
requested subsystem". The core design change is moving from a single unified
rstat tree to having separate trees: one for each enabled subsystem controller
(if it implements css_rstat_flush()), one for the base stats (cgroup::self)
subsystem state, and one dedicated to bpf-based cgroups (if enabled). A layer
of indirection was introduced to rstat. Where a cgroup reference was previously
used as a parameter, the updated/flush families of functions were changed to
instead accept a references to a new cgroup_rstat struct along with a new
interface containing ops that perform type-specific pointer offsets for
accessing common types. The ops allow rstat routines to work only with common
types while hiding away any unique types. Together, the new struct and
interface allows for extending the type of entity that can participate in
rstat. In this series, the cgroup_subsys_state and the cgroup_bpf are now
participants. For both of these structs, the cgroup_rstat struct was added as a
new field and ops were statically defined for both types in order to provide
access to related objects. To illustrate, the cgroup_subsys_state was given an
rstat_struct field and one of its ops was defined to get a pointer to the
rstat_struct of its parent.  Public api's were changed. In order for clients to
be specific about which stats are being updated or flushed, a reference to the
given cgroup_subsys_state is passed instead of the cgroup. For the bpf api's, a
cgroup is still passed as an argument since there is no subsystem state
associated with these custom cgroups. However, the names of the api calls were
changed and now have a "bpf_" prefix. Since separate trees are in use, the
locking scheme was adjusted to prevent any contention. Separate locks exist for
the three categories: base stats (cgroup::self), formal subsystem controllers
(memory, io, etc), and bpf-based cgroups. Where applicable, the functions for
lock management were adjusted to accept parameters instead of globals.

Breaking up the unified tree into separate trees eliminates the overhead
and scalability issue explained in the first section, but comes at the
expense of using additional memory. In an effort to minimize the additional
memory overhead, a conditional allocation is performed. The
cgroup_rstat_cpu originally contained the rstat list pointers and the base
stat entities. The struct was reduced to only contain the list pointers.
For the single case of where base stats are participating in rstat, a new
struct cgroup_rstat_base_cpu was created that contains the list pointers
and the base stat entities. The conditional allocation is done only when
the cgroup::self subsys_state is initialized. Since the list pointers exist
at the beginning of the cgroup_rstat_cpu and cgroup_rstat_base_cpu struct,
a union is used to access one type of pointer or the other depending on if
cgroup::self is detected; it is the one subsystem state where the subsystem
pointer is NULL. With this change, the total memory overhead on a per-cpu
basis is:

nr_cgroups * (
	sizeof(struct cgroup_rstat_base_cpu) +
	sizeof(struct cgroup_rstat_cpu) * nr_controllers
)

if bpf-based cgroups are enabled:

nr_cgroups * (
	sizeof(struct cgroup_rstat_base_cpu) +
	sizeof(struct cgroup_rstat_cpu) * (nr_controllers + 1)
)

... where nr_controllers is the number of enabled cgroup controllers
that implement css_rstat_flush().

With regard to validation, there is a measurable benefit when reading a
specific set of stats. Using the cpu stats as a basis for flushing, some
experiments were set up to measure the perf and time differences.

The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
within each child cgroup a process was spawned to encourage the updating of
memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). These 26 tasks were run in parallel.  While this was
going on, a custom program was used to open cpu.stat file of the parent
cgroup, read the entire file 1M times, then close it. The perf report for
the task performing the reading showed that most of the cycles (42%) were
spent on the function mem_cgroup_css_rstat_flush() of the control side. It
also showed a smaller but significant number of cycles spent in
__blkcg_rstat_flush. The perf report for patched kernel differed in that no
cycles were spent in these functions. Instead most cycles were spent on
cgroup_base_stat_flush(). Aside from the perf reports, the amount of time
spent running the program performing the reading of cpu.stats showed a gain
when comparing the control to the experimental kernel.The time in kernel
mode was reduced.

before:
real    0m18.449s
user    0m0.209s
sys     0m18.165s

after:
real    0m6.080s
user    0m0.170s
sys     0m5.890s

Another experiment on the same host was setup using a parent cgroup with
two child cgroups. The same swap and memory max were used as the previous
experiment. In the two child cgroups, kernel builds were done in parallel,
each using "-j 20". The program from the previous experiment was used to
perform 1M reads of the parent cpu.stat file. The perf comparison showed
similar results as the previous experiment. For the control side, a
majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant
cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were
spent on cgroup_base_stat_flush() and no cycles were spent flushing memory
or io. As for the time taken by the program reading cpu.stat, measurements
are shown below.

before:
real    0m17.223s
user    0m0.259s
sys     0m16.871s

after:
real    0m6.498s
user    0m0.237s
sys     0m6.220s

For the final experiment, perf events were recorded during a kernel build
with the same host and cgroup setup. The builds took place in the child
node.  Control and experimental sides both showed similar in cycles spent
on cgroup_rstat_updated() and appeard insignificant compared among the
events recorded with the workload.

JP Kobryn (11):
  cgroup: move rstat pointers into struct of their own
  cgroup: add level of indirection for cgroup_rstat struct
  cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  cgroup: introduce cgroup_rstat_ops
  cgroup: separate rstat for bpf cgroups
  cgroup: rstat lock indirection
  cgroup: fetch cpu-specific lock in rstat cpu lock helpers
  cgroup: rstat cpu lock indirection
  cgroup: separate rstat locks for bpf cgroups
  cgroup: separate rstat locks for subsystems
  cgroup: separate rstat list pointers from base stats

 block/blk-cgroup.c                            |   4 +-
 include/linux/bpf-cgroup-defs.h               |   3 +
 include/linux/cgroup-defs.h                   |  98 +--
 include/linux/cgroup.h                        |  11 +-
 include/linux/cgroup_rstat.h                  |  97 +++
 kernel/bpf/cgroup.c                           |   6 +
 kernel/cgroup/cgroup-internal.h               |   9 +-
 kernel/cgroup/cgroup.c                        |  65 +-
 kernel/cgroup/rstat.c                         | 556 +++++++++++++-----
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   6 +-
 12 files changed, 594 insertions(+), 270 deletions(-)
 create mode 100644 include/linux/cgroup_rstat.h

-- 
2.43.5