Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

Michal Koutný <mkoutny@xxxxxxxx> · Fri, 15 Nov 2024 14:41:45 +0100

Hello Yafang.

On Fri, Nov 08, 2024 at 09:29:00PM GMT, Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> container environment, we encountered several user-visible behavioral
> changes:
> 
> - Interrupted IRQ/softirq time is excluded in the cpuacct cgroup
> 
>   This breaks userspace applications that rely on CPU usage data from
>   cgroups to monitor CPU pressure. This patchset resolves the issue by
>   ensuring that IRQ/softirq time is included in the cgroup of the
>   interrupted tasks.
> 
> - getrusage(2) does not include time interrupted by IRQ/softirq
> 
>   Some services use getrusage(2) to check if workloads are experiencing CPU
>   pressure. Since IRQ/softirq time is no longer included in task runtime,
>   getrusage(2) can no longer reflect the CPU pressure caused by heavy
>   interrupts.

I understand that IRQ/softirq time is difficult to attribute to an
"accountable" entity and it's technically simplest to attribute it
everyone/noone, i.e. to root cgroup (or through a global stat w/out
cgroups).

> This patchset addresses the first issue, which is relatively
> straightforward. Once this solution is accepted, I will address the second
> issue in a follow-up patchset.

Is the first issue about cpuacct data or irq.pressure?

It sounds kind of both and I noticed the docs for irq.pressure is
lacking in Documentation/accounting/psi.rst. When you're touching this,
could you please add a paragraph or sentence explaining what does this
value represent?

(Also, there is same change both for cpuacct and
cgroup_base_stat_cputime_show(), right?)

>                    ----------------
>                    | Load Balancer|
>                    ----------------
>                 /    |      |        \
>                /     |      |         \ 
>           Server1 Server2 Server3 ... ServerN
> 
> Although the load balancer's algorithm is complex, it follows some core
> principles:
> 
> - When server CPU utilization increases, it adds more servers and deploys
>   additional instances to meet SLA requirements.
> - When server CPU utilization decreases, it scales down by decommissioning
>   servers and reducing the number of instances to save on costs.

A server here references to a whole node (whole kernel) or to a cgroup
(i.e. more servers on top of one kernel)?

> The load balancer is malfunctioning due to the exclusion of IRQ time from
> CPU utilization calculations.

Could this be fixed by subtracting (global) IRQ time from (presumed
total) system capacity that the balancer uses for its decisions? (i.e.
without exact per-cgroup breakdown of IRQ time)

Thanks,
Michal
Attachment:
signature.asc

Description: PGP signature