The current design of rstat takes the approach that if one subsystem is to be flushed, all other subsystem controllers with pending updates should also be flushed. It seems that over time, the stat-keeping of some subsystems has grown in size to the extent that they are noticeably slowing down others. This has been most observable in situations where the memory controller is enabled. One big area where the issue comes up is system telemetry, where programs periodically sample cpu stats. It would be a benefit for programs like this if the overhead of flushing memory stats (and others) could be eliminated. It would save cpu cycles for existing cpu-based telemetry programs and improve scalability in terms of increasing sampling frequency and volume of hosts the program is running on - the cpu cycles saved helps free up the budget for cycles to be spent on the desired stats. This series changes the approach of "flush all subsystems" to "flush only the requested subsystem". The core design change is moving from a single unified rstat tree to having separate trees: one for each enabled subsystem controller (if it implements css_rstat_flush()), one for the base stats (cgroup::self) subsystem state, and one dedicated to bpf-based cgroups (if enabled). A layer of indirection was introduced to rstat. Where a cgroup reference was previously used as a parameter, the updated/flush families of functions were changed to instead accept a references to a new cgroup_rstat struct along with a new interface containing ops that perform type-specific pointer offsets for accessing common types. The ops allow rstat routines to work only with common types while hiding away any unique types. Together, the new struct and interface allows for extending the type of entity that can participate in rstat. In this series, the cgroup_subsys_state and the cgroup_bpf are now participants. For both of these structs, the cgroup_rstat struct was added as a new field and ops were statically defined for both types in order to provide access to related objects. To illustrate, the cgroup_subsys_state was given an rstat_struct field and one of its ops was defined to get a pointer to the rstat_struct of its parent. Public api's were changed. In order for clients to be specific about which stats are being updated or flushed, a reference to the given cgroup_subsys_state is passed instead of the cgroup. For the bpf api's, a cgroup is still passed as an argument since there is no subsystem state associated with these custom cgroups. However, the names of the api calls were changed and now have a "bpf_" prefix. Since separate trees are in use, the locking scheme was adjusted to prevent any contention. Separate locks exist for the three categories: base stats (cgroup::self), formal subsystem controllers (memory, io, etc), and bpf-based cgroups. Where applicable, the functions for lock management were adjusted to accept parameters instead of globals. Breaking up the unified tree into separate trees eliminates the overhead and scalability issue explained in the first section, but comes at the expense of using additional memory. In an effort to minimize the additional memory overhead, a conditional allocation is performed. The cgroup_rstat_cpu originally contained the rstat list pointers and the base stat entities. The struct was reduced to only contain the list pointers. For the single case of where base stats are participating in rstat, a new struct cgroup_rstat_base_cpu was created that contains the list pointers and the base stat entities. The conditional allocation is done only when the cgroup::self subsys_state is initialized. Since the list pointers exist at the beginning of the cgroup_rstat_cpu and cgroup_rstat_base_cpu struct, a union is used to access one type of pointer or the other depending on if cgroup::self is detected; it is the one subsystem state where the subsystem pointer is NULL. With this change, the total memory overhead on a per-cpu basis is: nr_cgroups * ( sizeof(struct cgroup_rstat_base_cpu) + sizeof(struct cgroup_rstat_cpu) * nr_controllers ) if bpf-based cgroups are enabled: nr_cgroups * ( sizeof(struct cgroup_rstat_base_cpu) + sizeof(struct cgroup_rstat_cpu) * (nr_controllers + 1) ) ... where nr_controllers is the number of enabled cgroup controllers that implement css_rstat_flush(). With regard to validation, there is a measurable benefit when reading a specific set of stats. Using the cpu stats as a basis for flushing, some experiments were set up to measure the perf and time differences. The first experiment consisted of a parent cgroup with memory.swap.max=0 and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and within each child cgroup a process was spawned to encourage the updating of memory cgroup stats by creating and then reading a file of size 1T (encouraging reclaim). These 26 tasks were run in parallel. While this was going on, a custom program was used to open cpu.stat file of the parent cgroup, read the entire file 1M times, then close it. The perf report for the task performing the reading showed that most of the cycles (42%) were spent on the function mem_cgroup_css_rstat_flush() of the control side. It also showed a smaller but significant number of cycles spent in __blkcg_rstat_flush. The perf report for patched kernel differed in that no cycles were spent in these functions. Instead most cycles were spent on cgroup_base_stat_flush(). Aside from the perf reports, the amount of time spent running the program performing the reading of cpu.stats showed a gain when comparing the control to the experimental kernel.The time in kernel mode was reduced. before: real 0m18.449s user 0m0.209s sys 0m18.165s after: real 0m6.080s user 0m0.170s sys 0m5.890s Another experiment on the same host was setup using a parent cgroup with two child cgroups. The same swap and memory max were used as the previous experiment. In the two child cgroups, kernel builds were done in parallel, each using "-j 20". The program from the previous experiment was used to perform 1M reads of the parent cpu.stat file. The perf comparison showed similar results as the previous experiment. For the control side, a majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were spent on cgroup_base_stat_flush() and no cycles were spent flushing memory or io. As for the time taken by the program reading cpu.stat, measurements are shown below. before: real 0m17.223s user 0m0.259s sys 0m16.871s after: real 0m6.498s user 0m0.237s sys 0m6.220s For the final experiment, perf events were recorded during a kernel build with the same host and cgroup setup. The builds took place in the child node. Control and experimental sides both showed similar in cycles spent on cgroup_rstat_updated() and appeard insignificant compared among the events recorded with the workload. JP Kobryn (11): cgroup: move rstat pointers into struct of their own cgroup: add level of indirection for cgroup_rstat struct cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state cgroup: introduce cgroup_rstat_ops cgroup: separate rstat for bpf cgroups cgroup: rstat lock indirection cgroup: fetch cpu-specific lock in rstat cpu lock helpers cgroup: rstat cpu lock indirection cgroup: separate rstat locks for bpf cgroups cgroup: separate rstat locks for subsystems cgroup: separate rstat list pointers from base stats block/blk-cgroup.c | 4 +- include/linux/bpf-cgroup-defs.h | 3 + include/linux/cgroup-defs.h | 98 +-- include/linux/cgroup.h | 11 +- include/linux/cgroup_rstat.h | 97 +++ kernel/bpf/cgroup.c | 6 + kernel/cgroup/cgroup-internal.h | 9 +- kernel/cgroup/cgroup.c | 65 +- kernel/cgroup/rstat.c | 556 +++++++++++++----- mm/memcontrol.c | 4 +- .../selftests/bpf/progs/btf_type_tag_percpu.c | 5 +- .../bpf/progs/cgroup_hierarchical_stats.c | 6 +- 12 files changed, 594 insertions(+), 270 deletions(-) create mode 100644 include/linux/cgroup_rstat.h -- 2.43.5