unexpected CPU pressure measurements when applying cpu.max control

Michael Fitz-Payne <fitzy@xxxxxxxxxxxxx> · Wed, 26 Jun 2024 09:53:55 +1000

Hi there,

We've observed some unexpected CPU pressure measurements via the 
/proc/pressure/cpu interface when applying the cpu.max control within a 
cgroup.

In short, processes executing within a CPU-limited cgroup are 
contributing to the system-wide CPU pressure measurement. This results 
in misleading data that points toward system CPU contention, when no 
system-wide contention exists.

For example: we create a cgroup limited to a single CPU (cpu.max = 
'100000 100000') and within that cgroup we launch 10 processes vying 
over that CPU.

I'm using systemd-run in the command below for convenience sake, where 
the CPUQuota property sets the underlying cpu.max cgroup control.

The command that launches the 10 processes is `stress --cpu 10`.

[fitzy@~]$ uname -r
6.8.9-300.fc40.x86_64

Execute the process:

[fitzy@~]$ sudo systemd-run --property CPUQuota=100% --slice example 
stress --cpu 10
Running as unit: run-rf1c808a9ce1d4e7c82cc57ab90e728e3.service; 
invocation ID: 67b0808e72364325940cfa898231e83e

Observe the cgroup-specific CPU pressure measurement:

[fitzy@~]$ cat 
/sys/fs/cgroup/example.slice/run-rf1c808a9ce1d4e7c82cc57ab90e728e3.service/cpu.pressure
some avg10=87.32 avg60=86.44 avg300=56.96 total=272053462
full avg10=87.32 avg60=86.44 avg300=56.96 total=272053075

Compare to the system.slice CPU pressure measurement:

[fitzy@~]$ cat /sys/fs/cgroup/system.slice/cpu.pressure
some avg10=0.00 avg60=0.00 avg300=1.89 total=333141519
full avg10=0.00 avg60=0.00 avg300=1.89 total=332415623

Compare to the system-wide CPU pressure measurement:

[fitzy@~]$ cat /proc/pressure/cpu
some avg10=85.37 avg60=84.94 avg300=65.05 total=1655875251
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

I've compared these tests on a 5.10.0 system as well as 6.8.9 (above).

There are two differences I can see:

- On 5.10 the 'full' line is not present in either the cgroup 
cpu.pressure interface or the kernel /proc/pressure/cpu interface. I'm 
assuming this was added in a newer kernel at some point.

- On 6.8.9 the 'full' line in the cgroup cpu.pressure interface appears 
to provide accurate data based on this simple test.

As we know, the kernel 'full' measurement is undefined.

In either case, the kernel PSI interface is the canonical source from 
which we want to read the measurements for warning us of CPU contention 
on our fleet of machines. Due to this unexpected accounting, the values 
may be misleading.

Frankly, I'm not sure of what the behaviour should be. I can see the 
argument that the current value is correct, given the definition is 
'some' tasks are waiting on CPU.

However we have no data to fall back on - we cannot use the 'full' 
measurement from the kernel for CPU pressure. Unless we segregate all 
CPU-limited processes into their own cgroup slice and read distinct 
measurements from there, we also cannot rely on reading the cgroup(s) 
cpu.pressure interface.

For now, we are preferring the use of CPU weight controls - which only 
come into effect at saturation points - as a compromise. This isn't 
always the preferred control, because we sometimes want to place a hard 
cap on cpu-hungry but low-prio processes (e.g. log transformation services).

Does anyone have advice, or can comment on what the expected behaviour 
is under these circumstances? Perhaps this is simply WAI, and we need to 
make concessions higher up in the stack.

fitzy

---

Michael Fitz-Payne
System Administrator
Civilized Discourse Construction Kit, Inc.