Re: unexpected CPU pressure measurements when applying cpu.max control

Michael Fitz-Payne <fitzy@xxxxxxxxxxxxx> · Fri, 28 Jun 2024 14:31:00 +1000

Hi Tejun,

Thank you for the response.

On 28/6/24 06:58, Tejun Heo wrote:

In short, processes executing within a CPU-limited cgroup are contributing
to the system-wide CPU pressure measurement. This results in misleading data
that points toward system CPU contention, when no system-wide contention
exists.

This is in line with how PSI aggregation is defined for other resources. It
doesn't care why the pressure condition exists. e.g. If system.slice is the
only runnable top level cgroup and it's thrashing severely due to
memory.high, the system level metrics will be reporting full memory
pressure.

OK, that makes full sense why the pressure aggregation results in these 
reported measurements.

- On 5.10 the 'full' line is not present in either the cgroup cpu.pressure
interface or the kernel /proc/pressure/cpu interface. I'm assuming this was
added in a newer kernel at some point.

Yes, because full pressures are defined in terms of CPU cycles that couldn't
be consumed due to lack of the resource, initially, we didn't have
definition for CPU full pressure. Later, we used that for measuring cpu.max
throttling. It makes some sense but can also be argued that it's not quite
the same thing.

That explains what I had observed on 6.8.9 - the cgroup full 
cpu.pressure measurements were what I would have expected from the 
artificially constrained workload.

Whilst it may not be *quite* the same thing (some/full pressure), I 
think the outcome is better information that what older kernels provided.

As we know, the kernel 'full' measurement is undefined.

How do you mean?

I was specifically referring to the kernel.org documentation here, which 
states:

"CPU full is undefined at the system level, but has been reported since 
5.13, so it is set to zero for backward compatibility."

Accounting the above note on the cgroup cpu.pressure, I think I have a 
way forward.

This sounds more like you want to measure local (non-hierarchical) pressure.
Maybe that makes sense although I'm not sure whether this can be defined
neatly.

If we were to group CPU-limited (cpu.max) processes into cgroups 
separate from unconstrained processes (e.g. system.slice), we could more 
accurately observe the 'rest of the world' CPU pressure by mostly 
ignoring the cpu.pressure measurements from those cgroups.

This may be enough for what we need - which is really just a strong 
signal of *unexpected* CPU contention.

Thanks