[PATCH 0/4 v2] cgroup: separate rstat trees

inwardvessel <inwardvessel@xxxxxxxxx> · Thu, 27 Feb 2025 13:55:39 -0800

From: JP Kobryn <inwardvessel@xxxxxxxxx>

The current design of rstat takes the approach that if one subsystem is
to be flushed, all other subsystems with pending updates should also be
flushed. It seems that over time, the stat-keeping of some subsystems
has grown in size to the extent that they are noticeably slowing down
others. This has been most observable in situations where the memory
controller is enabled. One big area where the issue comes up is system
telemetry, where programs periodically sample cpu stats. It would be a
benefit for programs like this if the overhead of having to flush memory
stats (and others) could be eliminated. It would save cpu cycles for
existing cpu-based telemetry programs and improve scalability in terms
of sampling frequency and volume of hosts.

This series changes the approach of "flush all subsystems" to "flush
only the requested subsystem". The core design change is moving from a
single unified rstat tree of cgroups to having separate trees made up of
cgroup_subsys_state's. There will be one (per-cpu) tree for the base
stats (cgroup::self) and one for each enabled subsystem (if it
implements css_rstat_flush()). In order to do this, the rstat list
pointers were moved off of the cgroup and onto the css. In the
transition, these list pointer types were changed to
cgroup_subsys_state. This allows for rstat trees to now be made up of
css nodes, where a given tree will only contains css nodes associated
with a specific subsystem. The rstat api's were changed to accept a
reference to a cgroup_subsys_state instead of a cgroup. This allows for
callers to be specific about which stats are being updated/flushed.
Since separate trees will be in use, the locking scheme was adjusted.
The global locks were split up in such a way that there are separate
locks for the base stats (cgroup::self) and each subsystem (memory, io,
etc). This allows different subsystems (including base stats) to use
rstat in parallel with no contention.

Breaking up the unified tree into separate trees eliminates the overhead
and scalability issue explained in the first section, but comes at the
expense of using additional memory. In an effort to minimize this
overhead, a conditional allocation is performed. The cgroup_rstat_cpu
originally contained the rstat list pointers and the base stat entities.
This struct was renamed to cgroup_rstat_base_cpu and is only allocated
when the associated css is cgroup::self. A new compact struct was added
that only contains the rstat list pointers. When the css is associated
with an actual subsystem, this compact struct is allocated. With this
conditional allocation, the change in memory overhead on a per-cpu basis
before/after is shown below.

before:
sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */

nr_cgroups * sizeof(struct cgroup_rstat_cpu)
nr_cgroups * 176 bytes

after:
sizeof(struct cgroup_rstat_cpu) == 16 bytes
sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes

nr_cgroups * (
	sizeof(struct cgroup_rstat_base_cpu) +
		sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers
	)

nr_cgroups * (176 + 16 * nr_rstat_controllers)

... where nr_rstat_controllers is the number of enabled cgroup
controllers that implement css_rstat_flush(). On a host where both
memory and io are enabled:

nr_cgroups * (176 + 16 * 2)
nr_cgroups * 208 bytes

With regard to validation, there is a measurable benefit when reading
stats with this series. A test program was made to loop 1M times while
reading all four of the files cgroup.stat, cpu.stat, io.stat,
memory.stat of a given parent cgroup each iteration. This test program
has been run in the experiments that follow.

The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created
and within each child cgroup a process was spawned to frequently update
the memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). The test program was run alongside these 26 tasks
in parallel. The results showed a benefit in both time elapsed and perf
data of the test program.

time before:
real    0m44.612s
user    0m0.567s
sys     0m43.887s

perf before:
27.02% mem_cgroup_css_rstat_flush
 6.35% __blkcg_rstat_flush
 0.06% cgroup_base_stat_cputime_show

time after:
real    0m27.125s
user    0m0.544s
sys     0m26.491s

perf after:
6.03% mem_cgroup_css_rstat_flush
0.37% blkcg_print_stat
0.11% cgroup_base_stat_cputime_show

Another experiment was setup on the same host using a parent cgroup with
two child cgroups. The same swap and memory max were used as the
previous experiment. In the two child cgroups, kernel builds were done
in parallel, each using "-j 20". The perf comparison of the test program
was very similar to the values in the previous experiment. The time
comparison is shown below.

before:
real    1m2.077s
user    0m0.784s
sys     1m0.895s

after:
real    0m32.216s
user    0m0.709s
sys     0m31.256s

Note that the above two experiments were also done with a modified test
program that only reads the cpu.stat file and none of the other three
files previously mentioned. The results were similar to what was seen in
the v1 email for this series. See changelog for link to v1 if needed.

For the final experiment, perf events were recorded during a kernel
build with the same host and cgroup setup. The builds took place in the
child node. Control and experimental sides both showed similar in cycles
spent on cgroup_rstat_updated() and appeard insignificant compared among
the events recorded with the workload.

changelog
v2:
	drop the patch creating a new cgroup_rstat struct and related code
	drop bpf-specific patches. instead just use cgroup::self in bpf progs
	drop the cpu lock patches. instead select cpu lock in updated_list func
	relocate the cgroup_rstat_init() call to inside css_create()
	relocate the cgroup_rstat_exit() cleanup from apply_control_enable()
		to css_free_rwork_fn()
v1:
	https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@xxxxxxxxx/

JP Kobryn (4):
  cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  cgroup: rstat lock indirection
  cgroup: separate rstat locks for subsystems
  cgroup: separate rstat list pointers from base stats

 block/blk-cgroup.c                            |   4 +-
 include/linux/cgroup-defs.h                   |  67 ++--
 include/linux/cgroup.h                        |   8 +-
 kernel/cgroup/cgroup-internal.h               |   4 +-
 kernel/cgroup/cgroup.c                        |  53 +--
 kernel/cgroup/rstat.c                         | 318 +++++++++++-------
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   8 +-
 9 files changed, 276 insertions(+), 195 deletions(-)

-- 
2.43.5