Re: [PATCH 0/4 v2] cgroup: separate rstat trees

JP Kobryn <inwardvessel@xxxxxxxxx> · Wed, 5 Mar 2025 17:07:04 -0800

On 3/3/25 7:19 AM, Michal Koutný wrote:
Hello JP.

On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel <inwardvessel@xxxxxxxxx> wrote:
From: JP Kobryn <inwardvessel@xxxxxxxxx>

The current design of rstat takes the approach that if one subsystem is
to be flushed, all other subsystems with pending updates should also be
flushed. It seems that over time, the stat-keeping of some subsystems
has grown in size to the extent that they are noticeably slowing down
others. This has been most observable in situations where the memory
controller is enabled. One big area where the issue comes up is system
telemetry, where programs periodically sample cpu stats. It would be a
benefit for programs like this if the overhead of having to flush memory
stats (and others) could be eliminated. It would save cpu cycles for
existing cpu-based telemetry programs and improve scalability in terms
of sampling frequency and volume of hosts.

This series changes the approach of "flush all subsystems" to "flush
only the requested subsystem".
...

before:
sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */

nr_cgroups * sizeof(struct cgroup_rstat_cpu)
nr_cgroups * 176 bytes

after:
...
nr_cgroups * (176 + 16 * 2)
nr_cgroups * 208 bytes

~ 32B/cgroup/cpu

Thanks. I'll make this clear in the cover letter next rev.

With regard to validation, there is a measurable benefit when reading
stats with this series. A test program was made to loop 1M times while
reading all four of the files cgroup.stat, cpu.stat, io.stat,
memory.stat of a given parent cgroup each iteration. This test program
has been run in the experiments that follow.

Thanks for looking into this and running experiments on the behavior of
split rstat trees.

And thank you for reviewing along with the good questions.

The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created
and within each child cgroup a process was spawned to frequently update
the memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). The test program was run alongside these 26 tasks
in parallel. The results showed a benefit in both time elapsed and perf
data of the test program.

time before:
real    0m44.612s
user    0m0.567s
sys     0m43.887s

perf before:
27.02% mem_cgroup_css_rstat_flush
  6.35% __blkcg_rstat_flush
  0.06% cgroup_base_stat_cputime_show

time after:
real    0m27.125s
user    0m0.544s
sys     0m26.491s

So this shows that flushing rstat trees one by one (as the test program
reads *.stat) is quicker than flushing all at once (+idle reads of
*.stat).
Interesting, I'd not bet on that at first but that is convincing to
favor the separate trees approach.

perf after:mem_cgroup_css_rstat_flush
6.03% mem_cgroup_css_rstat_flush
0.37% blkcg_print_stat
0.11% cgroup_base_stat_cputime_show

I'd understand why the series reduces time spent in
mem_cgroup_flush_stats() but what does the lower proportion of
mem_cgroup_css_rstat_flush() show?

When the entry point for flushing is reading the file memory.stat, 
memory_stat_show() is called which leads to __mem_cgroup_flush_stats(). 
In this function, there is an early return when (!force && !needs_flush) 
is true. This opportunity to "skip" a flush is not reached when another 
subsystem has initiated the flush and entry point for flushing memory is 
css->css_rstat_flush().

To verify above, I made use of a tracepoint previously added [0] to get 
info info on the number of memcg flushes performed vs skipped. In a 
comparison between reading only the memory.stat file vs reading 
{memory,io,cpu}.stat files under the same test, the flush count 
increased by about the same value the skip count decreased.

Reading memory.stat
non-forced flushes: 5781
flushes skipped: 995826

Reading {memory,io.cpu}.stat
non-forced flushes: 12047
flushes skipped: 990857

If the flushes were not skipped, I think we would see similar proportion 
of mem_cgroup_css_rstat_flush() when reading memory.stat.

[0] 
https://lore.kernel.org/all/20241029021106.25587-1-inwardvessel@xxxxxxxxx/

Another experiment was setup on the same host using a parent cgroup with
two child cgroups. The same swap and memory max were used as the
previous experiment. In the two child cgroups, kernel builds were done
in parallel, each using "-j 20". The perf comparison of the test program
was very similar to the values in the previous experiment. The time
comparison is shown below.

before:
real    1m2.077s
user    0m0.784s
sys     1m0.895s

This is 1M loops of stats reading program like before? I.e. if this
should be analogous to 0m44.612s above why isn't it same? (I'm thinking
of more frequent updates in the latter test.)

Yes. One notable difference on this test is there are more threads in 
the workload (40 vs 26) which are doing the updates.

after:
real    0m32.216s
user    0m0.709s
sys     0m31.256s

What was impact on the kernel build workloads (cgroup_rstat_updated)?

You can now find some workload timing results further down. If you're 
asking specifically about time spent in cgroup_rstat_updated(), perf 
reports show fractional values on both sides.

(Perhaps the saved 30s of CPU work (if potentially moved from readers to
writers) would be spread too thin in all of two 20-parallel kernel
builds, right?)

Are you suggesting a workload with fewer threads?

...
For the final experiment, perf events were recorded during a kernel
build with the same host and cgroup setup. The builds took place in the
child node. Control and experimental sides both showed similar in cycles
spent on cgroup_rstat_updated() and appeard insignificant compared among
the events recorded with the workload.

What's the change between control vs experiment? Runnning in root cg vs
nested? Or running without *.stat readers vs with them against the
kernel build?
(This clarification would likely answer my question above.)

workload control with no readers:
real    6m54.818s
user    117m3.122s
sys     5m4.996s

workload experiment with no readers:
real    6m54.862s
user    117m12.812s
sys     5m0.943s

workload control with constant readers {memory,io,cpu,cgroup}.stat:
real    6m59.468s
user    118m26.981s
sys     5m20.163s

workload experiment with constant readers {memory,io,cpu,cgroup}.stat:
real    6m57.031s
user    118m13.833s
sys     5m3.454s

These tests were done in a child (nested) cgroup. Were you also asking 
for a root vs nested experiment or were you just needing clarification 
on the test details?

Michal