Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Leonardo Bras <leobras@xxxxxxxxxx> · Thu, 14 Nov 2024 00:40:17 -0300

On Fri, Feb 23, 2024 at 01:37:39AM -0300, Leonardo Bras wrote:
> On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote:
> > On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> > > On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
> > >> >> Is scenarios where there is no need to keep track of IRQ handled, convert
> > >> >> it back to IRQ_HANDLED.
> > >> >
> > >> > That's not really workable as you'd have to update tons of drivers just
> > >> > to deal with that corner case. That's error prone and just extra
> > >> > complexity all over the place.
> > >
> > > I agree, that's a downside of this implementation. 
> > 
> > A serious one which is not really workable. See below.
> > 
> > > I agree the above may be able to solve the issue, but it would make 2 extra 
> > > atomic ops necessary in the thread handling the IRQ, as well as one extra 
> > > atomic operation in note_interrupt(), which could increase latency on this 
> > > IRQ deferring the handler to a thread.
> > >
> > > I mean, yes, the cpu running note_interrupt() would probably already have 
> > > exclusiveness for this cacheline, but it further increases cacheline 
> > > bouncing and also adds the mem barriers that incur on atomic operations, 
> > > even if we use an extra bit from threads_handled instead of allocate a new 
> > > field for threads_running.
> > 
> > I think that's a strawman. Atomic operations can of course be more
> > expensive than non-atomic ones, but they only start to make a difference
> > when the cache line is contended. That's not the case here for the
> > normal operations.
> > 
> > Interrupts and their threads are strictly targeted to a single CPU and
> > the cache line is already hot and had to be made exclusive because of
> > other write operations to it.
> > 
> > There is usually no concurrency at all, except for administrative
> > operations like enable/disable or affinity changes. Those administrative
> > operations are not high frequency and the resulting cache line bouncing
> > is unavoidable even without that change. But does it matter in the
> > larger picture? I don't think so.
> 
> That's a fair point, but there are some use cases that use CPU Isolation on 
> top of PREEMPT_RT in order to reduce interference on a CPU running an RT 
> workload.
> 
> For those cases, IIRC the handler will run on a different (housekeeping) 
> CPU when those IRQs originate on an Isolated CPU, meaning the above 
> described cacheline bouncing is expected.
> 
> 
> > 
> > > On top of that, let's think on a scenario where the threaded handler will 
> > > solve a lot of requests, but not necessarily spend a lot of time doing so.
> > > This allows the thread to run for little time while solving a lot of 
> > > requests.
> > >
> > > In this scenario, note_interrupt() could return without incrementing 
> > > irqs_unhandled for those IRQ that happen while the brief thread is running, 
> > > but every other IRQ would cause note_interrupt() to increase 
> > > irqs_unhandled, which would cause the bug to still reproduce.
> > 
> > In theory yes. Does it happen in practice?
> > 
> > But that exposes a flaw in the actual detection code. The code is
> > unconditionally accumulating if there is an unhandled interrupt within
> > 100ms after the last unhandled one. IOW, if there is a periodic
> > unhandled one every 50ms, the interrupt will be shut down after 100000 *
> > 50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of
> > actually handled interrupts.
> > 
> > The spurious detector is really about runaway interrupts which hog a CPU
> > completely, but the above is not what we want to protect against.
> 
> Now it makes a lot more sense to me.
> Thanks!

Hi Thomas,

I would like to go back to this discussion :)