On Thu, Nov 14, 2024 at 12:40:17AM -0300, Leonardo Bras wrote: > On Fri, Feb 23, 2024 at 01:37:39AM -0300, Leonardo Bras wrote: > > On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote: > > > On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote: > > > > On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote: > > > >> >> Is scenarios where there is no need to keep track of IRQ handled, convert > > > >> >> it back to IRQ_HANDLED. > > > >> > > > > >> > That's not really workable as you'd have to update tons of drivers just > > > >> > to deal with that corner case. That's error prone and just extra > > > >> > complexity all over the place. > > > > > > > > I agree, that's a downside of this implementation. > > > > > > A serious one which is not really workable. See below. > > > > > > > I agree the above may be able to solve the issue, but it would make 2 extra > > > > atomic ops necessary in the thread handling the IRQ, as well as one extra > > > > atomic operation in note_interrupt(), which could increase latency on this > > > > IRQ deferring the handler to a thread. > > > > > > > > I mean, yes, the cpu running note_interrupt() would probably already have > > > > exclusiveness for this cacheline, but it further increases cacheline > > > > bouncing and also adds the mem barriers that incur on atomic operations, > > > > even if we use an extra bit from threads_handled instead of allocate a new > > > > field for threads_running. > > > > > > I think that's a strawman. Atomic operations can of course be more > > > expensive than non-atomic ones, but they only start to make a difference > > > when the cache line is contended. That's not the case here for the > > > normal operations. > > > > > > Interrupts and their threads are strictly targeted to a single CPU and > > > the cache line is already hot and had to be made exclusive because of > > > other write operations to it. > > > > > > There is usually no concurrency at all, except for administrative > > > operations like enable/disable or affinity changes. Those administrative > > > operations are not high frequency and the resulting cache line bouncing > > > is unavoidable even without that change. But does it matter in the > > > larger picture? I don't think so. > > > > That's a fair point, but there are some use cases that use CPU Isolation on > > top of PREEMPT_RT in order to reduce interference on a CPU running an RT > > workload. > > > > For those cases, IIRC the handler will run on a different (housekeeping) > > CPU when those IRQs originate on an Isolated CPU, meaning the above > > described cacheline bouncing is expected. > > > > > > > > > > > On top of that, let's think on a scenario where the threaded handler will > > > > solve a lot of requests, but not necessarily spend a lot of time doing so. > > > > This allows the thread to run for little time while solving a lot of > > > > requests. > > > > > > > > In this scenario, note_interrupt() could return without incrementing > > > > irqs_unhandled for those IRQ that happen while the brief thread is running, > > > > but every other IRQ would cause note_interrupt() to increase > > > > irqs_unhandled, which would cause the bug to still reproduce. > > > > > > In theory yes. Does it happen in practice? > > > > > > But that exposes a flaw in the actual detection code. The code is > > > unconditionally accumulating if there is an unhandled interrupt within > > > 100ms after the last unhandled one. IOW, if there is a periodic > > > unhandled one every 50ms, the interrupt will be shut down after 100000 * > > > 50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of > > > actually handled interrupts. > > > > > > The spurious detector is really about runaway interrupts which hog a CPU > > > completely, but the above is not what we want to protect against. > > > > Now it makes a lot more sense to me. > > Thanks! > > Hi Thomas, > > I would like to go back to this discussion :) > From what I could understand, and read back the thread: > > - The spurious detector is used to avoid cpu hog when a lots of IRQs are > hitting a cpu, but few ( < 100 / 100k) are being handled. It works by > disabling that interruption. > > - The bug I am dealing with (on serial8250), happens to fit exactly at > above case: lots of requests, but few are handled. > The reason: threaded handler, many requests, and they are dealt with in > batch: multiple requests are handled at once, but a single IRQ_HANDLED > returned. > > - My proposed solution: Find a way of accounting the requests handled. > > - Implementation: add an option for drivers voluntarily report how > many requests they handled. Current drivers need no change. > - Limitation: If this issue is found on another driver, we need to > implement accounting there as well. This may only happen on drivers > which handle over 1k requests at once. > What was left for me TODO: > Think on a generic solution for this issue, to avoid dealing with that > in a per-driver basis. > > That's what I was able to think about: > - Only the driver code knows how many requests it handled, so without > touching them we can't know how many requests were properly handled. Hmm... But do I understand correctly the following: - the IRQ core knows the amount of generated IRQs for the device (so it's kinda obvious that IRQ number maps to the driver); - the IRQ core disables IRQ while handling an IRQ number in question; - the driver is supposed to handle all IRQs that were reported at the beginning o.f its handler; - taking the above the amount of handled IRQs is what came till the disabling IRQ. IRQs that came after should be replayed when IRQ gets enabled. ? > - I could try thinking a different solution, which involves changing only > the spurious detector. > > - For that I would need to find a particular characteristic we would want > to avoid spurious detection against, and make sure it won't miss an > actual case we want to be protected about. > > Generic solutions(?) proposed: > - Zero irqs_unhandled if threaded & handles a single request in 100k > - Problem: A regular issue with the interruption would not be detected > in the driver. > > - Skip detection if threaded & the handling thread is running > - Problem 1: the thread may run shortly and batch handle a lot of stuff, > not being detected by the spurious detector. > - Problem 2: the thread may get stuck, not handle the IRQs and also not > being detected by the spurious handler. (IIUC) > > > In the end, I could not find a proper way of telling apart > a - "this is a real spurious IRQ behavior, which needs to be disabled", and > b - "this is just a handler that batch-handles it's requests", > without touching the drivers' code. > > Do you have any suggestion on how to do that? -- With Best Regards, Andy Shevchenko