On Wed, Jan 17, 2024 at 11:08:44PM +0100, Thomas Gleixner wrote: > On Tue, Jan 16 2024 at 04:36, Leonardo Bras wrote: > > This IRQ line disable bug can be easily reproduced with a serial8250 > > console on a PREEMPT_RT kernel: it only takes the user to print a lot > > of text to the console (or to ttyS0): around 300k chars should be > > enough. > > That has nothing to do with RT, it's a problem of force threaded > interrupts in combination with an edge type interrupt line and a > hardware which keeps firing interrupts forever. Hello Thomas, thanks for your feedback! I agreed it has nothing to do with RT. I just mentioned PREEMPT_RT as my test case scenario, since it enables force-threaded IRQs. > > > To fix this bug, reset irqs_unhandled whenever irq_thread handles at least > > one IRQ request. > > This papers over the symptom and makes runaway detection way weaker for > all interrupts or breaks it completely. This change is supposed to only touch threaded interruptions, since it will reach the included line only if (action_ret == IRQ_WAKE_THREAD) and if desc->threads_handled changes since the last IRQ request. This incrementing also happens only on irq_forced_thread_fn() and irq_thread_fn(), which are called only from irq_thread_fn(). But I get the overall worry about having this making runaway detection way weaker for all threaded interrupts. I have previously worked on a solution that can be more precise and be an opt-in for drivers instead of a general solution: It required a change in IRQ interface that let the handlers inform how many IRQs were actually handled (batching). This number would then be added to desc->threads_handle (in irq_*thread_fn(), just changing the atomic_inc() to atomic_add()), and then subtracted from irqs_unhandled at note_interrupt(). In the serial8250 case, the driver would be changed to use that interface, since it's already able to process multiple IRQs, and the bug just vanishes. This also solved the serial driver issue, but required a deeper change in the code, which caused me to consider a simpler solution first. This solution sure does give better runnaway detection. Do you think it would be better that the one I sent in this patch? > > The problem with edge type interrupts is that we cannot mask them like > we do with level type interrupts in the hard interrupt handler and > unmask them once the threaded handler finishes. > > So yes, we need special rules here when: > > 1) The interrupt handler is force threaded > > 2) The interrupt line is edge type > > 3) The accumulated unhandled interrupts are within a sane margin > > Thanks, > > tglx > Completelly agree, that's why I am suggesting dealing with threaded interruptions in a different way: reseting the unhandled count when it handles a request. I am not sure how force threaded and just threaded are different in this scenario. Could you help me understand? Thanks! Leo