On 2023/8/3 Tejun Heo wrote:
I couldn't come up with an answer. Let's go ahead with adding the field but
can you please do the followings?
Thank you for your suggestion. I am very willing to do it.
* Name it to something like subtree_bstat instead of cumul_bstat. The
counters are all cumulative.
I did it in v2 patch.
* Are you sure the upward propagation logic is correct? It's calculating
global delta and then propagating to the per-cpu delta of the parent. Is
that correct because the two delta calculations always end up the same?
Sorry, I made a mistake and misled you. These two deltas are not always
equal.
I found and reproduced a bug case:
We build /sys/fs/cgroup/test /sys/fs/cgroup/test/t1,
/sys/fs/cgroup/test/t1/tt1 in turn.
And there are only tasks in /sys/fs/cgroup/test/t1/tt1. After applying
this patch, some operations and corresponding data changes are as follows:
Step 1、cat /sys/fs/cgroup/test/t1/tt1/cpu.stat
*cpu 6* current flush cgroup /test/t1/tt1
per-cpu delta utime 0 stime 0 sum_exec_runtime 257341
/test/t1/tt1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 257341
/test/t1/tt1 cgrp->bstat utime 0 stime 0 sum_exec_runtime 257341
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 0 sum_exec_runtime 257341
parent(/test/t1) ->bstat utime 0 stime 0 sum_exec_runtime 257341
parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test/t1) cumul_bstat utime 0 stime 0 sum_exec_runtime 257341
}
Step 2、cat /sys/fs/cgroup/test/t1/tt1/cpu.stat
*cpu 12* current flush cgroup /test/t1/tt1
per-cpu delta utime 0 stime 1000000 sum_exec_runtime 747042
/test/t1/tt1 cumul_bstat utime 0 stime 1000000 sum_exec_runtime 747042
/test/t1/tt1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1004383
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 1000000 sum_exec_runtime 747042
parent(/test/t1) ->bstat utime 0 stime 1000000 sum_exec_runtime 1004383
parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test/t1) cumul_bstat utime 0 stime 1000000 sum_exec_runtime
747042
}
Step 3、cat /sys/fs/cgroup/test/cpu.stat
(cgroup fulsh /test/t1/tt1 -> /test/t1 -> /test in turn)
*cpu 6* current flush cgroup /test/t1/tt1
per-cpu delta utime 0 stime 0 sum_exec_runtime 263468
/test/t1/tt1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 520809
/test/t1/tt1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 0 sum_exec_runtime 263468
parent(/test/t1) ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test/t1) cumul_bstat utime 0 stime 0 sum_exec_runtime 520809
}
*cpu 6* current flush cgroup /test/t1
per-cpu delta utime 0 stime 0 sum_exec_runtime 0
/test/t1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 520809
/test/t1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 1000000 sum_exec_runtime 1267851 <---
parent(/test) ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
parent(/test) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test) cumul_bstat (cpu 6) utime 0 stime 1000000
sum_exec_runtime 1267851 <--- *error*
******
Here cgrp delta is *not equal* to per-cpu delta.
The frequency of cgroup (/test) and its chiled cgroup (/test/t1/tt1)
flush is inconsistent.
In other words (when we call cgroup_base_stat_flush(), we will not
necessarily flush to the highest-level cgroup except root(like step 1
and 2 above)).
Therefore, cgrp delta may contain the cumulative value of multiple
per-cpu deltas.
The correct value of parent(/test) cumul_bstat should be utime 0
stime 0 sum_exec_runtime 520809.
******
}
*cpu 6* current flush cgroup /test
per-cpu delta utime 0 stime 0 sum_exec_runtime 0
cumul_bstat utime 0 stime 1000000 sum_exec_runtime 1267851
/test ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
cgroup_parent(parent) is NULL end.
So we should add a per-cpu variable subtree_last_bstat similar to
cgrp->last_bstat to record the last value.
I have sent v2 patch, please review it again.
v2 link:
https://lore.kernel.org/all/20230807032930.87785-1-jiahao.os@xxxxxxxxxxxxx
* Please add a comment explaining that the field is not currently used
outside of being read from bpf / drgn and what not and that we're still
trying to determine how to expose that in the cgroupfs interface.
Thanks, I did it in v2 patch.
Thanks,
Hao