After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our container environment, we encountered several user-visible behavioral changes: - Interrupted IRQ/softirq time is excluded in the cpuacct cgroup This breaks userspace applications that rely on CPU usage data from cgroups to monitor CPU pressure. This patchset resolves the issue by ensuring that IRQ/softirq time is included in the cgroup of the interrupted tasks. - getrusage(2) does not include time interrupted by IRQ/softirq Some services use getrusage(2) to check if workloads are experiencing CPU pressure. Since IRQ/softirq time is no longer included in task runtime, getrusage(2) can no longer reflect the CPU pressure caused by heavy interrupts. This patchset addresses the first issue, which is relatively straightforward. Once this solution is accepted, I will address the second issue in a follow-up patchset. Enabling CONFIG_IRQ_TIME_ACCOUNTING modifies the way CPU utilization is reported by excluding the time spent handling interrupts (IRQs) from the CPU usage metric. As a result, we lose visibility into how much time the CPU was actually interrupted, relative to its total utilization. This can lead to a misleading interpretation of the CPU's activity, where interrupted IRQ time is erroneously perceived as sleep time. |<----Runtime---->|<----Sleep---->|<----Runtime---->|<---Sleep-->| When, in reality, it should be: |<----Runtime---->|<--Interrupted time-->|<----Runtime---->|<---Sleep-->| Currently, the only ways to monitor interrupt time are through IRQ PSI or the IRQ time recorded in delay accounting. However, these metrics are independent of CPU utilization, which makes it difficult to combine them into a single, unified measure CPU utilization is a critical metric for almost all workloads, and it's problematic if it fails to reflect the full extent of system pressure. This situation is similar to iowait: when a task is in iowait, it could be due to other tasks performing I/O. It doesn’t matter if the I/O is being done by one of your tasks or by someone else's; what matters is that your task is stalled and waiting on I/O. Similarly, a comprehensive CPU utilization metric should reflect all sources of pressure, including IRQ time, to provide a more accurate representation of workload behavior. One of the applications impacted by this issue is our Redis load-balancing service. The setup operates as follows: ---------------- | Load Balancer| ---------------- / | | \ / | | \ Server1 Server2 Server3 ... ServerN Although the load balancer's algorithm is complex, it follows some core principles: - When server CPU utilization increases, it adds more servers and deploys additional instances to meet SLA requirements. - When server CPU utilization decreases, it scales down by decommissioning servers and reducing the number of instances to save on costs. On our servers, the majority of IRQ/softIRQ activity originates from network traffic, and we consistently enable Receive Flow Steering (RFS) [0]. This configuration ensures that softIRQs are more likely to interrupt the tasks responsible for processing the corresponding packets. As a result, the distribution of softIRQs is not random but instead closely aligned with the packet-handling tasks. The load balancer is malfunctioning due to the exclusion of IRQ time from CPU utilization calculations. Unfortunately, there is no effective way to reintegrate IRQ time into CPU utilization metrics using currently available tools. Consequently, we are left with no choice but to modify the kernel code to address this issue. Link: https://lwn.net/Articles/381955/ [0] Changes: v6->v7: - Fix psi_show() (Michal) v5->v6: https://lore.kernel.org/all/20241211131729.43996-1-laoar.shao@xxxxxxxxx/ - Return EOPNOTSUPP in psi_show() if irqtime is disabled (Michal) - Collect Reviewed-by from Michal v4->v5: https://lore.kernel.org/all/20241108132904.6932-1-laoar.shao@xxxxxxxxx/ - Don't use static key in the IRQ_TIME_ACCOUNTING=n case (Peter) - Rename psi_irq_time to irq_time (Peter) - Use CPUTIME_IRQ instead of CPUTIME_SOFTIRQ (Peter) v3->v4: https://lore.kernel.org/all/20241101031750.1471-1-laoar.shao@xxxxxxxxx/ - Rebase v2->v3: - Add a helper account_irqtime() to avoid redundant code (Johannes) v1->v2: https://lore.kernel.org/cgroups/20241008061951.3980-1-laoar.shao@xxxxxxxxx/ - Fix lockdep issues reported by kernel test robot <oliver.sang@xxxxxxxxx> v1: https://lore.kernel.org/all/20240923090028.16368-1-laoar.shao@xxxxxxxxx/ Yafang Shao (4): sched: Define sched_clock_irqtime as static key sched: Don't account irq time if sched_clock_irqtime is disabled sched, psi: Don't account irq time if sched_clock_irqtime is disabled sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING kernel/sched/core.c | 77 +++++++++++++++++++++++++++++------------- kernel/sched/cputime.c | 16 ++++----- kernel/sched/psi.c | 14 +++----- kernel/sched/sched.h | 15 +++++++- kernel/sched/stats.h | 7 ++-- 5 files changed, 84 insertions(+), 45 deletions(-) -- 2.43.5