On Mon, Oct 26 2020 at 17:28, Guilherme Piccoli wrote: > On Mon, Oct 26, 2020 at 4:59 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >> It gets flooded right at the point where the crash kernel enables >> interrupts in start_kernel(). At that point there is no device driver >> and no interupt requested. All you can see on the console for this is >> >> "common_interrupt: $VECTOR.$CPU No irq handler for vector" >> >> And contrary to Liu's patches which try to disable a requested interrupt >> if too many of them arrive, the kernel cannot do anything because there >> is nothing to disable in your case. That's why you needed to do the MSI >> disable magic in the early PCI quirks which run before interrupts get >> enabled. > > Wow, thank you very much for this great explanation (without a > reproducer) - it's nice to hear somebody that deeply understands the > code! And double thanks for CCing Bjorn. Understanding the code is only half of the picture. You need to understand how the hardware works or not :) > So, I don't want to hijack Liu's thread, but do you think it makes > sense to have my approach as a (debug) parameter to prevent such a > degenerate case? At least it makes sense to some extent even if it's incomplete. What bothers me is that it'd be x86 specific while the issue is pretty much architecture independent. I don't think that the APIC is special in that regard. Rogue MSIs should be able to bring down pretty much all architectures. > Or could we have something in core IRQ code to prevent irq flooding in > such scenarios, something "stronger" than disabling MSIs (APIC-level, > likely)? For your case? No. The APIC cannot be protected against rogue MSIs. The only cure is to disable interrupts or disable MSIs on all PCI[E] devices early on. Disabling interrupts is not so much of an option obviously :) Thanks, tglx