On Fri, Nov 1, 2024 at 6:28 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Fri, Nov 01, 2024 at 11:17:50AM +0800, Yafang Shao wrote: > > After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our > > container environment, we observed several noticeable behavioral changes. > > > > One of our IRQ-heavy services, such as Redis, reported a significant > > reduction in CPU usage after upgrading to the new kernel with > > CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads > > to handle an increased workload, the CPU usage could not be raised. In > > other words, even though the container’s CPU usage appeared low, it was > > unable to process more workloads to utilize additional CPU resources, which > > caused issues. > > > We can verify the CPU usage of the test cgroup using cpuacct.stat. The > > output shows: > > > > system: 53 > > user: 2 > > > > The CPU usage of the cgroup is relatively low at around 55%, but this usage > > doesn't increase, even with more netperf tasks. The reason is that CPU0 is > > at 100% utilization, as confirmed by mpstat: > > > > 02:56:22 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle > > 02:56:23 PM 0 0.99 0.00 55.45 0.00 0.99 42.57 0.00 0.00 0.00 0.00 > > > > 02:56:23 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle > > 02:56:24 PM 0 2.00 0.00 55.00 0.00 0.00 43.00 0.00 0.00 0.00 0.00 > > > > It is clear that the %soft is not accounted into the cgroup of the > > interrupted task. This behavior is unexpected. We should account for IRQ > > time to the cgroup to reflect the pressure the group is under. > > > > After a thorough analysis, I discovered that this change in behavior is due > > to commit 305e6835e055 ("sched: Do not account irq time to current task"), > > which altered whether IRQ time should be charged to the interrupted task. > > While I agree that a task should not be penalized by random interrupts, the > > task itself cannot progress while interrupted. Therefore, the interrupted > > time should be reported to the user. > > > > The system metric in cpuacct.stat is crucial in indicating whether a > > container is under heavy system pressure, including IRQ/softirq activity. > > Hence, IRQ/softirq time should be accounted for in the cpuacct system > > usage, which also applies to cgroup2’s rstat. > > > > This patch reintroduces IRQ/softirq accounting to cgroups. > > How !? what does it actually do? It seems there's some misunderstanding due to the term *accounting* here. What it actually does is track the interrupted time within a cgroup. > > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > > --- > > kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++-- > > kernel/sched/psi.c | 14 +++----------- > > kernel/sched/stats.h | 7 ++++--- > > 3 files changed, 38 insertions(+), 16 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 06a06f0897c3..5ed2c5c8c911 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -5579,6 +5579,35 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms); > > static inline u64 cpu_resched_latency(struct rq *rq) { return 0; } > > #endif /* CONFIG_SCHED_DEBUG */ > > > > +#ifdef CONFIG_IRQ_TIME_ACCOUNTING > > +static void account_irqtime(struct rq *rq, struct task_struct *curr, > > + struct task_struct *prev) > > +{ > > + int cpu = smp_processor_id(); > > + s64 delta; > > + u64 irq; > > + > > + if (!static_branch_likely(&sched_clock_irqtime)) > > + return; > > + > > + irq = irq_time_read(cpu); > > + delta = (s64)(irq - rq->psi_irq_time); > > At this point the variable is no longer exclusive to PSI and should > probably be renamed. OK. > > > + if (delta < 0) > > + return; > > + > > + rq->psi_irq_time = irq; > > + psi_account_irqtime(rq, curr, prev, delta); > > + cgroup_account_cputime(curr, delta); > > + /* We account both softirq and irq into softirq */ > > + cgroup_account_cputime_field(curr, CPUTIME_SOFTIRQ, delta); > > This seems wrong.. we have CPUTIME_IRQ. OK. > > > +} > > In fact, much of this seems like it's going about things sideways. > > Why can't you just add the cgroup_account_*() garbage to > irqtime_account_irq()? That is were it's still split out into softirq > and irq. I previously implemented this in v1: link. However, in that version, we had to hold the irq_lock within the critical path, which could impact performance. Taking inspiration from commit ddae0ca2a8fe ("sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath"), I've now adapted the approach to handle it in a non-critical path, reducing the performance impact. > > But the much bigger question is -- how can you be sure that this > interrupt is in fact for the cgroup you're attributing it to? Could be > for an entirely different cgroup. As I explained in another thread, identifying the exact culprit can be challenging, but identifying the victim is straightforward. That’s precisely what this patch set accomplishes. -- Regards Yafang