On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel wrote: > From: JP Kobryn <inwardvessel@xxxxxxxxx> > > The current design of rstat takes the approach that if one subsystem is > to be flushed, all other subsystems with pending updates should also be > flushed. It seems that over time, the stat-keeping of some subsystems > has grown in size to the extent that they are noticeably slowing down > others. This has been most observable in situations where the memory > controller is enabled. One big area where the issue comes up is system > telemetry, where programs periodically sample cpu stats. It would be a > benefit for programs like this if the overhead of having to flush memory > stats (and others) could be eliminated. It would save cpu cycles for > existing cpu-based telemetry programs and improve scalability in terms > of sampling frequency and volume of hosts. > > This series changes the approach of "flush all subsystems" to "flush > only the requested subsystem". The core design change is moving from a > single unified rstat tree of cgroups to having separate trees made up of > cgroup_subsys_state's. There will be one (per-cpu) tree for the base > stats (cgroup::self) and one for each enabled subsystem (if it > implements css_rstat_flush()). In order to do this, the rstat list > pointers were moved off of the cgroup and onto the css. In the > transition, these list pointer types were changed to > cgroup_subsys_state. This allows for rstat trees to now be made up of > css nodes, where a given tree will only contains css nodes associated > with a specific subsystem. The rstat api's were changed to accept a > reference to a cgroup_subsys_state instead of a cgroup. This allows for > callers to be specific about which stats are being updated/flushed. > Since separate trees will be in use, the locking scheme was adjusted. > The global locks were split up in such a way that there are separate > locks for the base stats (cgroup::self) and each subsystem (memory, io, > etc). This allows different subsystems (including base stats) to use > rstat in parallel with no contention. > > Breaking up the unified tree into separate trees eliminates the overhead > and scalability issue explained in the first section, but comes at the > expense of using additional memory. In an effort to minimize this > overhead, a conditional allocation is performed. The cgroup_rstat_cpu > originally contained the rstat list pointers and the base stat entities. > This struct was renamed to cgroup_rstat_base_cpu and is only allocated > when the associated css is cgroup::self. A new compact struct was added > that only contains the rstat list pointers. When the css is associated > with an actual subsystem, this compact struct is allocated. With this > conditional allocation, the change in memory overhead on a per-cpu basis > before/after is shown below. > > before: > sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */ > > nr_cgroups * sizeof(struct cgroup_rstat_cpu) > nr_cgroups * 176 bytes > > after: > sizeof(struct cgroup_rstat_cpu) == 16 bytes > sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes > > nr_cgroups * ( > sizeof(struct cgroup_rstat_base_cpu) + > sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers > ) > > nr_cgroups * (176 + 16 * nr_rstat_controllers) > > ... where nr_rstat_controllers is the number of enabled cgroup > controllers that implement css_rstat_flush(). On a host where both > memory and io are enabled: > > nr_cgroups * (176 + 16 * 2) > nr_cgroups * 208 bytes > > With regard to validation, there is a measurable benefit when reading > stats with this series. A test program was made to loop 1M times while > reading all four of the files cgroup.stat, cpu.stat, io.stat, > memory.stat of a given parent cgroup each iteration. This test program > has been run in the experiments that follow. > > The first experiment consisted of a parent cgroup with memory.swap.max=0 > and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created > and within each child cgroup a process was spawned to frequently update > the memory cgroup stats by creating and then reading a file of size 1T > (encouraging reclaim). The test program was run alongside these 26 tasks > in parallel. The results showed a benefit in both time elapsed and perf > data of the test program. > > time before: > real 0m44.612s > user 0m0.567s > sys 0m43.887s > > perf before: > 27.02% mem_cgroup_css_rstat_flush > 6.35% __blkcg_rstat_flush > 0.06% cgroup_base_stat_cputime_show > > time after: > real 0m27.125s > user 0m0.544s > sys 0m26.491s > > perf after: > 6.03% mem_cgroup_css_rstat_flush > 0.37% blkcg_print_stat > 0.11% cgroup_base_stat_cputime_show > > Another experiment was setup on the same host using a parent cgroup with > two child cgroups. The same swap and memory max were used as the > previous experiment. In the two child cgroups, kernel builds were done > in parallel, each using "-j 20". The perf comparison of the test program > was very similar to the values in the previous experiment. The time > comparison is shown below. > > before: > real 1m2.077s > user 0m0.784s > sys 1m0.895s > > after: > real 0m32.216s > user 0m0.709s > sys 0m31.256s Great results, and I am glad that the series went down from 11 patches to 4 once we simplified the BPF handling. The added memory overhead doesn't seem to be concerning (~320KB on a system with 100 cgroups and 100 CPUs). Nice work.