Hi there,
We've observed some unexpected CPU pressure measurements via the
/proc/pressure/cpu interface when applying the cpu.max control within a
cgroup.
In short, processes executing within a CPU-limited cgroup are
contributing to the system-wide CPU pressure measurement. This results
in misleading data that points toward system CPU contention, when no
system-wide contention exists.
For example: we create a cgroup limited to a single CPU (cpu.max =
'100000 100000') and within that cgroup we launch 10 processes vying
over that CPU.
I'm using systemd-run in the command below for convenience sake, where
the CPUQuota property sets the underlying cpu.max cgroup control.
The command that launches the 10 processes is `stress --cpu 10`.
[fitzy@~]$ uname -r
6.8.9-300.fc40.x86_64
Execute the process:
[fitzy@~]$ sudo systemd-run --property CPUQuota=100% --slice example
stress --cpu 10
Running as unit: run-rf1c808a9ce1d4e7c82cc57ab90e728e3.service;
invocation ID: 67b0808e72364325940cfa898231e83e
Observe the cgroup-specific CPU pressure measurement:
[fitzy@~]$ cat
/sys/fs/cgroup/example.slice/run-rf1c808a9ce1d4e7c82cc57ab90e728e3.service/cpu.pressure
some avg10=87.32 avg60=86.44 avg300=56.96 total=272053462
full avg10=87.32 avg60=86.44 avg300=56.96 total=272053075
Compare to the system.slice CPU pressure measurement:
[fitzy@~]$ cat /sys/fs/cgroup/system.slice/cpu.pressure
some avg10=0.00 avg60=0.00 avg300=1.89 total=333141519
full avg10=0.00 avg60=0.00 avg300=1.89 total=332415623
Compare to the system-wide CPU pressure measurement:
[fitzy@~]$ cat /proc/pressure/cpu
some avg10=85.37 avg60=84.94 avg300=65.05 total=1655875251
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
I've compared these tests on a 5.10.0 system as well as 6.8.9 (above).
There are two differences I can see:
- On 5.10 the 'full' line is not present in either the cgroup
cpu.pressure interface or the kernel /proc/pressure/cpu interface. I'm
assuming this was added in a newer kernel at some point.
- On 6.8.9 the 'full' line in the cgroup cpu.pressure interface appears
to provide accurate data based on this simple test.
As we know, the kernel 'full' measurement is undefined.
In either case, the kernel PSI interface is the canonical source from
which we want to read the measurements for warning us of CPU contention
on our fleet of machines. Due to this unexpected accounting, the values
may be misleading.
Frankly, I'm not sure of what the behaviour should be. I can see the
argument that the current value is correct, given the definition is
'some' tasks are waiting on CPU.
However we have no data to fall back on - we cannot use the 'full'
measurement from the kernel for CPU pressure. Unless we segregate all
CPU-limited processes into their own cgroup slice and read distinct
measurements from there, we also cannot rely on reading the cgroup(s)
cpu.pressure interface.
For now, we are preferring the use of CPU weight controls - which only
come into effect at saturation points - as a compromise. This isn't
always the preferred control, because we sometimes want to place a hard
cap on cpu-hungry but low-prio processes (e.g. log transformation services).
Does anyone have advice, or can comment on what the expected behaviour
is under these circumstances? Perhaps this is simply WAI, and we need to
make concessions higher up in the stack.
fitzy
---
Michael Fitz-Payne
System Administrator
Civilized Discourse Construction Kit, Inc.