On 2/10/25 11:38 PM, Michal Koutný Wrote:
Hello Abel (sorry for my delay).
On Wed, Jan 29, 2025 at 12:48:09PM +0800, Abel Wu <wuyun.abel@xxxxxxxxxxxxx> wrote:
PSI tracks stall times for each cpu, and
tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
which turns nr_delayed_tasks[cpu] into boolean value, hence loses
insight into how severely this task group is stalled on this cpu.
Thanks for example. So the lost information is kind of a group load.
Exactly.
What meaning it has when there is no group throttling?
It means how severely this cgroup is interfered by co-located tasks.
Both psi and run_delay are tracked in (part of) our fleet, and the
spikes usually lead to poor SLI. But we do find circumstances that
run_delay has a better correlation with SLI due to the abovementioned
method of stall time accounting.
They are treated as indicators of triggering throttling or evicting
the co-located low priority jobs.
In fact we also track per-cpu stats (cpu.stat.percpu) for cgroups,
including run_delay which helped us to decide which job to be the
victim, and also provided useful info when we diagnose issues.
Honestly, I can't reason neither about PSI.some nor Σ run_delay wrt
feedback for resource control. What it is slightly bugging me is
introduction of another stats field before first one was explored :-)
But if there's information loss with PSI -- could cpu.pressure:some be
removed in favor of Σ run_delay? (The former could be calculated from
latter if you're right :-p)
It is not my intent to replacing cpu.pressure:some by run_delay. The
former provides a normalized value that can be used to compare among
different cgroups while the latter isn't able to.
(I didn't like the before/after shuffling with enum cpu_usage_stat
NR_STATS but I saw v4 where you tackled that.)
Michal
More context form previous message, the difference is between a) and c),
or better equal lanes:
a')
t1 |----|
t2 |xx--|
t3 |----|
c)
t1 |----|
t2 |xx--|
t3 |xx--|
<-Δt->
Yes, a) and c) have same cpu.pressure:some but make different progress.
run_delay can be calculated indepently of cpu.pressure:some
because there is still difference between a') and c) in terms of total
cpu usage.
Δrun_delay = nr * Δt - Δusage
The challenge is with nr (assuming they're all runnable during Δt), that
would need to be sampled from /sys/kernel/debug/sched/debug. But then
you can get whatever load for individual cfs_rqs from there. Hm, does it
even make sense to add up run_delays from different CPUs?
Very good question. In our case, this summed value is used as a general
indicator to trigger strategy which further depends on raw per-cpu data
provided by cpu.stat.percpu, which implies that what we actually want is
the per-cpu data.
Thanks & Best Regards,
Abel