On 2/20/25 7:51 AM, Tejun Heo wrote:
Hello,
On Mon, Feb 17, 2025 at 07:14:37PM -0800, JP Kobryn wrote:
...
The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
within each child cgroup a process was spawned to encourage the updating of
memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). These 26 tasks were run in parallel. While this was
going on, a custom program was used to open cpu.stat file of the parent
cgroup, read the entire file 1M times, then close it. The perf report for
the task performing the reading showed that most of the cycles (42%) were
spent on the function mem_cgroup_css_rstat_flush() of the control side. It
also showed a smaller but significant number of cycles spent in
__blkcg_rstat_flush. The perf report for patched kernel differed in that no
cycles were spent in these functions. Instead most cycles were spent on
cgroup_base_stat_flush(). Aside from the perf reports, the amount of time
spent running the program performing the reading of cpu.stats showed a gain
when comparing the control to the experimental kernel.The time in kernel
mode was reduced.
before:
real 0m18.449s
user 0m0.209s
sys 0m18.165s
after:
real 0m6.080s
user 0m0.170s
sys 0m5.890s
Another experiment on the same host was setup using a parent cgroup with
two child cgroups. The same swap and memory max were used as the previous
experiment. In the two child cgroups, kernel builds were done in parallel,
each using "-j 20". The program from the previous experiment was used to
perform 1M reads of the parent cpu.stat file. The perf comparison showed
similar results as the previous experiment. For the control side, a
majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant
cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were
spent on cgroup_base_stat_flush() and no cycles were spent flushing memory
or io. As for the time taken by the program reading cpu.stat, measurements
are shown below.
before:
real 0m17.223s
user 0m0.259s
sys 0m16.871s
after:
real 0m6.498s
user 0m0.237s
sys 0m6.220s
For the final experiment, perf events were recorded during a kernel build
with the same host and cgroup setup. The builds took place in the child
node. Control and experimental sides both showed similar in cycles spent
on cgroup_rstat_updated() and appeard insignificant compared among the
events recorded with the workload.
One of the reasons why the original design used one rstat tree is because
readers, in addition to writers, can often be correlated too - e.g. You'd
often have periodic monitoring tools which poll all the major stat files
periodically. Splitting the trees will likely make those at least a bit
worse. Can you test how much worse that'd be? ie. Repeat the above tests but
read all the major stat files - cgroup.stat, cpu.stat, memory.stat and
io.stat.
Sure. I changed the experiment to read all of these files. It still
showed an improvement in performance. You can see the details in
v2 [0] which I sent out earlier today.
[0]
https://lore.kernel.org/all/20250227215543.49928-1-inwardvessel@xxxxxxxxx/
Thanks.