Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

Yafang Shao <laoar.shao@xxxxxxxxx> · Mon, 18 Nov 2024 20:12:03 +0800

On Mon, Nov 18, 2024 at 6:10 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Sun, Nov 17, 2024 at 10:56:21AM +0800, Yafang Shao wrote:
> > On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@xxxxxxxx> wrote:
>
> > > > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > > > CPU utilization calculations.
> > >
> > > Could this be fixed by subtracting (global) IRQ time from (presumed
> > > total) system capacity that the balancer uses for its decisions? (i.e.
> > > without exact per-cgroup breakdown of IRQ time)
> >
> > The issue here is that the global IRQ time may include the interrupted
> > time of tasks outside the target cgroup. As a result, I don't believe
> > it's possible to find a reliable solution without modifying the
> > kernel.
>
> Since there is no relation between the interrupt and the interrupted
> task (and through that its cgroup) -- all time might or might not be
> part of your cgroup of interest. Consider it a random distribution if
> you will.
>
> What Michael suggests seems no less fair, and possible more fair than
> what you propose:
>
>  \Sum cgroup = total - IRQ

The key issue here is determining how to reliably get the IRQ. I don't
believe there is a dependable way to achieve this.

For example, consider a server with 16 CPUs. My cgroup contains 4
threads that can freely migrate across CPUs, while other tasks are
also running on the system simultaneously. In this scenario, how can
we accurately determine the IRQ to subtract?

>
> As opposed to what you propose:
>
>  \Sum (cgroup + cgroup-IRQ) = total - remainder-IRQ
>
> Like I argued earlier, if you have two cgroups, one doing a while(1)
> loop (proxy for doing computation) and one cgroup doing heavy IO or
> networking, then per your accounting the computation cgroup will get a
> significant amount of IRQ time 'injected', even though it is effidently
> not of that group.

That is precisely what the user wants. If my tasks are frequently
interrupted by IRQs, it indicates that my service may be experiencing
poor quality. In response, I would likely reduce the traffic sent to
it. If the issue persists and IRQ interruptions remain high, I would
then consider migrating the service to other servers.

>
> Injecting 'half' of the interrupts in the computation group, and missing
> 'half' of the interrupts from the network group will get 'wrong'
> load-balance results too.
>
> I remain unconvinced that any of this makes sense.

If we are uncertain about which choice makes more sense, it might be
better to align this behavior with the case where
CONFIG_IRQ_TIME_ACCOUNTING=n.

-- 
Regards
Yafang