Re: Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups

Abel Wu <wuyun.abel@xxxxxxxxxxxxx> · Wed, 12 Feb 2025 23:17:19 +0800

On 2/11/25 2:25 AM, Johannes Weiner Wrote:
On Mon, Feb 10, 2025 at 06:20:12AM -1000, Tejun Heo wrote:
On Mon, Feb 10, 2025 at 04:38:56PM +0100, Michal Koutný wrote:
...
The challenge is with nr (assuming they're all runnable during Δt), that
would need to be sampled from /sys/kernel/debug/sched/debug. But then
you can get whatever load for individual cfs_rqs from there. Hm, does it
even make sense to add up run_delays from different CPUs?

The difficulty in aggregating across CPUs is why some and full pressures are
defined the way they are. Ideally, we'd want full distribution of stall
states across CPUs but both aggregation and presentation become challenging,
so some/full provide the two extremes. Sum of all cpu_delay adds more
incomplete signal on top. I don't know how useful it'd be. At meta, we
depend on PSI a lot when investigating resource problems and we've never
felt the need for the sum time, so that's one data point with the caveat
that usually our focus is on mem and io pressures where some and full
pressure metrics usually seem to provide sufficient information.

As the picture provided by some and full metrics is incomplete, I can
imagine adding the sum being useful. That said, it'd help if Able can
provide more concrete examples on it being useful. Another thing to consider
is whether we should add this across resources monitored by PSI - cpu, mem
and io.

Yes, a more detailed description of the usecase would be helpful.

Please check my last reply to see our usecase, and it would be appreciated
if you can shed some light on it.

I'm not exactly sure how the sum of wait times in a cgroup would be
used to gauge load without taking available concurrency into account.
One second of aggregate wait time means something very different if
you have 200 cpus compared to if you have 2.

Indeed. So instead of comparing between different cgroups, we only
compare with the past for each cgroup we care about.

This is precisely what psi tries to capture. "Some" does provide group
loading information in a sense, but it's a ratio over available
concurrency, and currently capped at 100%. I.e. if you have N cpus,
100% some is "at least N threads waiting at all times." There is a
gradient below that, but not above.

It's conceivable percentages over 100% might be useful, to capture the
degree of contention beyond that. Although like Tejun says, we've not
felt the need for that so far. Whether something is actionable or not
tends to be in the 0-1 range, and beyond that it's just "all bad".

PSI is very useful and we heavily rely on it, especially for mem and
io. The reason of adding run_delay is try to find the piece lost by
cpu.some when <100%. Please check Michal's example.

High overload scenarios can also be gauged with tools like runqlat[1],
which give a histogram over individual tasks' delays. We've used this
one extensively to track down issues.

[1] https://github.com/iovisor/bcc/blob/master/tools/runqlat.py

Thanks for mentioning this, but I am not sure whether it is appropriate
to make it constantly enabled as we rely it to be realtime to trigger
throttling or eviction of certain jobs.

Thanks & Best Regards,
	Abel