On Wed, Feb 05, 2025 at 02:31:56PM +0800, Feng Tang wrote: > On Tue, Feb 04, 2025 at 10:14:10AM +0100, Lukas Wunner wrote: > > On Tue, Feb 04, 2025 at 01:37:58PM +0800, Feng Tang wrote: > > > There was a irq storm bug when testing "pci=nomsi" case, and the root > > > cause is: 'nomsi' will disable MSI and let devices and root ports use > > > legacy INTX inerrupt, and likely make several devices/ports share one > > > interrupt. In the failure case, BIOS doesn't disable the PCIE hotplug > > > interrupts, and actually asserts the command-complete interrupt. > > > As MSI is disabled, ACPI initialization code will not enumerate root > > > port's PCIE hotplug capability, and pciehp service driver wont' be > > > enabled for the root port to handle that interrupt, later on when it is > > > shared and enabled by other device driver like NVME or NIC, the "nobody > > > care irq storm" happens. > > > > > > So disable the pcie hotplug CCIE/HPIE interrupt in early boot phase when > > > MSI is not enbaled. > > > > So I think this issue should go away if disabling the interrupt > > by portdrv is no longer conditional on > > > > (pcie_ports_native || host->native_pcie_hotplug) > > > > like I've just proposed here: > > > > https://lore.kernel.org/r/Z6HYuBDP6uvE1Sf4@xxxxxxxxx/ > > > > ... in which case this patch won't be necessary. Can you confirm that? > > Thanks for the suggestion! I will try to get the platform for test, > and report back. I haven't got the platform, but I recalled something, that disabling HP interrupts inside get_port_device_capability()/portdrv_probe() got called after the nvme_probe(), so it may still cause the irq storm due to: * pcie root port's hotplug interrupt asserted * the interrupt is shared with NVME and other device * those device drivers enable the interrupt line early before portdrv's probe() That's why we tried to put the disabling early in PCI initialization code. Thanks, Feng > As for the change, > + if (!IS_ENABLED(CONFIG_HOTPLUG_PCI_PCIE)) > + pcie_capability_clear_word(dev, PCI_EXP_SLTCTL, > + PCI_EXP_SLTCTL_CCIE | PCI_EXP_SLTCTL_HPIE); > > The CONFIG_HOTPLUG_PCI_PCIE is always enabled on our platform and many > distros, I guess the check needs to be removed, which sees the 1 second > waiting again, and need the waiting logic in 1/2 patch? > > Thanks, > Feng