Re: [PATCH 0/4 v2] cgroup: separate rstat trees

Yosry Ahmed <yosry.ahmed@xxxxxxxxx> · Fri, 28 Feb 2025 18:22:57 +0000

On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@xxxxxxxxx>
> 
> The current design of rstat takes the approach that if one subsystem is
> to be flushed, all other subsystems with pending updates should also be
> flushed. It seems that over time, the stat-keeping of some subsystems
> has grown in size to the extent that they are noticeably slowing down
> others. This has been most observable in situations where the memory
> controller is enabled. One big area where the issue comes up is system
> telemetry, where programs periodically sample cpu stats. It would be a
> benefit for programs like this if the overhead of having to flush memory
> stats (and others) could be eliminated. It would save cpu cycles for
> existing cpu-based telemetry programs and improve scalability in terms
> of sampling frequency and volume of hosts.
> 
> This series changes the approach of "flush all subsystems" to "flush
> only the requested subsystem". The core design change is moving from a
> single unified rstat tree of cgroups to having separate trees made up of
> cgroup_subsys_state's. There will be one (per-cpu) tree for the base
> stats (cgroup::self) and one for each enabled subsystem (if it
> implements css_rstat_flush()). In order to do this, the rstat list
> pointers were moved off of the cgroup and onto the css. In the
> transition, these list pointer types were changed to
> cgroup_subsys_state. This allows for rstat trees to now be made up of
> css nodes, where a given tree will only contains css nodes associated
> with a specific subsystem. The rstat api's were changed to accept a
> reference to a cgroup_subsys_state instead of a cgroup. This allows for
> callers to be specific about which stats are being updated/flushed.
> Since separate trees will be in use, the locking scheme was adjusted.
> The global locks were split up in such a way that there are separate
> locks for the base stats (cgroup::self) and each subsystem (memory, io,
> etc). This allows different subsystems (including base stats) to use
> rstat in parallel with no contention.
> 
> Breaking up the unified tree into separate trees eliminates the overhead
> and scalability issue explained in the first section, but comes at the
> expense of using additional memory. In an effort to minimize this
> overhead, a conditional allocation is performed. The cgroup_rstat_cpu
> originally contained the rstat list pointers and the base stat entities.
> This struct was renamed to cgroup_rstat_base_cpu and is only allocated
> when the associated css is cgroup::self. A new compact struct was added
> that only contains the rstat list pointers. When the css is associated
> with an actual subsystem, this compact struct is allocated. With this
> conditional allocation, the change in memory overhead on a per-cpu basis
> before/after is shown below.
> 
> before:
> sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */
> 
> nr_cgroups * sizeof(struct cgroup_rstat_cpu)
> nr_cgroups * 176 bytes
> 
> after:
> sizeof(struct cgroup_rstat_cpu) == 16 bytes
> sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes
> 
> nr_cgroups * (
> 	sizeof(struct cgroup_rstat_base_cpu) +
> 		sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers
> 	)
> 
> nr_cgroups * (176 + 16 * nr_rstat_controllers)
> 
> ... where nr_rstat_controllers is the number of enabled cgroup
> controllers that implement css_rstat_flush(). On a host where both
> memory and io are enabled:
> 
> nr_cgroups * (176 + 16 * 2)
> nr_cgroups * 208 bytes
> 
> With regard to validation, there is a measurable benefit when reading
> stats with this series. A test program was made to loop 1M times while
> reading all four of the files cgroup.stat, cpu.stat, io.stat,
> memory.stat of a given parent cgroup each iteration. This test program
> has been run in the experiments that follow.
> 
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created
> and within each child cgroup a process was spawned to frequently update
> the memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). The test program was run alongside these 26 tasks
> in parallel. The results showed a benefit in both time elapsed and perf
> data of the test program.
> 
> time before:
> real    0m44.612s
> user    0m0.567s
> sys     0m43.887s
> 
> perf before:
> 27.02% mem_cgroup_css_rstat_flush
>  6.35% __blkcg_rstat_flush
>  0.06% cgroup_base_stat_cputime_show
> 
> time after:
> real    0m27.125s
> user    0m0.544s
> sys     0m26.491s
> 
> perf after:
> 6.03% mem_cgroup_css_rstat_flush
> 0.37% blkcg_print_stat
> 0.11% cgroup_base_stat_cputime_show
> 
> Another experiment was setup on the same host using a parent cgroup with
> two child cgroups. The same swap and memory max were used as the
> previous experiment. In the two child cgroups, kernel builds were done
> in parallel, each using "-j 20". The perf comparison of the test program
> was very similar to the values in the previous experiment. The time
> comparison is shown below.
> 
> before:
> real    1m2.077s
> user    0m0.784s
> sys     1m0.895s
> 
> after:
> real    0m32.216s
> user    0m0.709s
> sys     0m31.256s

Great results, and I am glad that the series went down from 11 patches
to 4 once we simplified the BPF handling. The added memory overhead
doesn't seem to be concerning (~320KB on a system with 100 cgroups and
100 CPUs).

Nice work.