On Mon, Oct 26, 2020 at 4:59 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > On Mon, Oct 26 2020 at 12:06, Guilherme Piccoli wrote: > > On Sun, Oct 25, 2020 at 8:12 AM Pingfan Liu <kernelfans@xxxxxxxxx> wrote: > > > > Some time ago (2 years) we faced a similar issue in x86-64, a hard to > > debug problem in kdump, that eventually was narrowed to a buggy NIC FW > > flooding IRQs in kdump kernel, and no messages showed (although kernel > > changed a lot since that time, today we might have better IRQ > > handling/warning). We tried an early-boot fix, by disabling MSIs (as > > per PCI spec) early in x86 boot, but it wasn't accepted - Bjorn asked > > pertinent questions that I couldn't respond (I lost the reproducer) > > [0]. > ... > > [0] lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@xxxxxxxxxxxxx > > With that broken firmware the NIC continued to send MSI messages to the > vector/CPU which was assigned to it before the crash. But the crash > kernel has no interrupt descriptor for this vector installed. So Liu's > patches wont print anything simply because the interrupt core cannot > detect it. > > To answer Bjorns still open question about when the point X is: > > https://lore.kernel.org/linux-pci/20181023170343.GA4587@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > It gets flooded right at the point where the crash kernel enables > interrupts in start_kernel(). At that point there is no device driver > and no interupt requested. All you can see on the console for this is > > "common_interrupt: $VECTOR.$CPU No irq handler for vector" > > And contrary to Liu's patches which try to disable a requested interrupt > if too many of them arrive, the kernel cannot do anything because there > is nothing to disable in your case. That's why you needed to do the MSI > disable magic in the early PCI quirks which run before interrupts get > enabled. > > Also Liu's patch only works if: > > 1) CONFIG_IRQ_TIME_ACCOUNTING is enabled > > 2) the runaway interrupt has been requested by the relevant driver in > the dump kernel. > > Especially #1 is not a sensible restriction. > > Thanks, > > tglx Wow, thank you very much for this great explanation (without a reproducer) - it's nice to hear somebody that deeply understands the code! And double thanks for CCing Bjorn. So, I don't want to hijack Liu's thread, but do you think it makes sense to have my approach as a (debug) parameter to prevent such a degenerate case? Or could we have something in core IRQ code to prevent irq flooding in such scenarios, something "stronger" than disabling MSIs (APIC-level, likely)? Cheers, Guilherme