Re: [PATCH v4 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 1 Nov 2024 20:04:11 +0800

On Fri, Nov 1, 2024 at 6:28 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Nov 01, 2024 at 11:17:50AM +0800, Yafang Shao wrote:
> > After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our
> > container environment, we observed several noticeable behavioral changes.
> >
> > One of our IRQ-heavy services, such as Redis, reported a significant
> > reduction in CPU usage after upgrading to the new kernel with
> > CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads
> > to handle an increased workload, the CPU usage could not be raised. In
> > other words, even though the container’s CPU usage appeared low, it was
> > unable to process more workloads to utilize additional CPU resources, which
> > caused issues.
>
> > We can verify the CPU usage of the test cgroup using cpuacct.stat. The
> > output shows:
> >
> >   system: 53
> >   user: 2
> >
> > The CPU usage of the cgroup is relatively low at around 55%, but this usage
> > doesn't increase, even with more netperf tasks. The reason is that CPU0 is
> > at 100% utilization, as confirmed by mpstat:
> >
> >   02:56:22 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >   02:56:23 PM    0    0.99    0.00   55.45    0.00    0.99   42.57    0.00    0.00    0.00    0.00
> >
> >   02:56:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >   02:56:24 PM    0    2.00    0.00   55.00    0.00    0.00   43.00    0.00    0.00    0.00    0.00
> >
> > It is clear that the %soft is not accounted into the cgroup of the
> > interrupted task. This behavior is unexpected. We should account for IRQ
> > time to the cgroup to reflect the pressure the group is under.
> >
> > After a thorough analysis, I discovered that this change in behavior is due
> > to commit 305e6835e055 ("sched: Do not account irq time to current task"),
> > which altered whether IRQ time should be charged to the interrupted task.
> > While I agree that a task should not be penalized by random interrupts, the
> > task itself cannot progress while interrupted. Therefore, the interrupted
> > time should be reported to the user.
> >
> > The system metric in cpuacct.stat is crucial in indicating whether a
> > container is under heavy system pressure, including IRQ/softirq activity.
> > Hence, IRQ/softirq time should be accounted for in the cpuacct system
> > usage, which also applies to cgroup2’s rstat.
> >
> > This patch reintroduces IRQ/softirq accounting to cgroups.
>
> How !? what does it actually do?

It seems there's some misunderstanding due to the term *accounting*
here. What it actually does is track the interrupted time within a
cgroup.

>
> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> > ---
> >  kernel/sched/core.c  | 33 +++++++++++++++++++++++++++++++--
> >  kernel/sched/psi.c   | 14 +++-----------
> >  kernel/sched/stats.h |  7 ++++---
> >  3 files changed, 38 insertions(+), 16 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 06a06f0897c3..5ed2c5c8c911 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5579,6 +5579,35 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
> >  static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
> >  #endif /* CONFIG_SCHED_DEBUG */
> >
> > +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> > +static void account_irqtime(struct rq *rq, struct task_struct *curr,
> > +                         struct task_struct *prev)
> > +{
> > +     int cpu = smp_processor_id();
> > +     s64 delta;
> > +     u64 irq;
> > +
> > +     if (!static_branch_likely(&sched_clock_irqtime))
> > +             return;
> > +
> > +     irq = irq_time_read(cpu);
> > +     delta = (s64)(irq - rq->psi_irq_time);
>
> At this point the variable is no longer exclusive to PSI and should
> probably be renamed.

OK.

>
> > +     if (delta < 0)
> > +             return;
> > +
> > +     rq->psi_irq_time = irq;
> > +     psi_account_irqtime(rq, curr, prev, delta);
> > +     cgroup_account_cputime(curr, delta);
> > +     /* We account both softirq and irq into softirq */
> > +     cgroup_account_cputime_field(curr, CPUTIME_SOFTIRQ, delta);
>
> This seems wrong.. we have CPUTIME_IRQ.

OK.

>
> > +}
>
> In fact, much of this seems like it's going about things sideways.
>
> Why can't you just add the cgroup_account_*() garbage to
> irqtime_account_irq()? That is were it's still split out into softirq
> and irq.

I previously implemented this in v1: link. However, in that version,
we had to hold the irq_lock within the critical path, which could
impact performance. Taking inspiration from commit ddae0ca2a8fe
("sched: Move psi_account_irqtime() out of update_rq_clock_task()
hotpath"), I've now adapted the approach to handle it in a
non-critical path, reducing the performance impact.

>
> But the much bigger question is -- how can you be sure that this
> interrupt is in fact for the cgroup you're attributing it to? Could be
> for an entirely different cgroup.

As I explained in another thread, identifying the exact culprit can be
challenging, but identifying the victim is straightforward. That’s
precisely what this patch set accomplishes.

-- 
Regards
Yafang