On Mon, Apr 13, 2015 at 11:45:31AM -0500, Eric W. Biederman wrote: > Bjorn Helgaas <bhelgaas@xxxxxxxxxx> writes: > > > On Mon, Apr 13, 2015 at 4:37 AM, Fam Zheng <famz@xxxxxxxxxx> wrote: > >> Hi Bjorn, > >> > >> On Fri, 04/10 17:54, Bjorn Helgaas wrote: > >>> From: Michael S. Tsirkin <mst@xxxxxxxxxx> > >>> > >>> d52877c7b1af ("pci/irq: let pci_device_shutdown to call pci_msi_shutdown > >>> v2") disabled MSI/MSI-X at device shutdown to address a kexec problem. > >>> > >>> The problem is that after we disable MSI, the device may assert INTx, and > >>> if the driver hasn't registered an interrupt handler for it, the interrupt > >>> is never deasserted and causes a kernel hang. In particular, this was > >>> observed with virtio. > >>> > >>> We now disable MSI/MSI-X for all devices during enumeration regardless of > >>> CONFIG_PCI_MSI. This solves the kexec problem in the new kernel, not the > >>> old one. > >>> > >>> Stop disabling MSIs at shutdown to avoid the kernel hang. > >>> > >>> XXX bugzilla reference, details about how the hang happens? > >> > >> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96571 > >> > >> Please let me know if you need any further information in the bug. > > > > Please attach a complete dmesg log. The bugzilla doesn't really have > > any new information other than that you see a soft lockup. I'm trying > > to connect more of the dots between a spurious interrupt and a hang or > > soft lockup. > > > > The bugzilla implies that there is a screaming irq (which causes the > softlockup when they disable the kernels protections for buggy irqs). > > > It doesn't seem right that a spurious interrupt could cause a hang or > > soft lockup. > > The interrupt handler keeps firing. > > > I would think Linux would emit a message about the > > unexpected interrupt, but would otherwise be relatively unconcerned. > > That was disabled on the kernel command line. > > > So I'm trying to figure out why my assumption is wrong. Probably this > > is just because I don't know much about Linux IRQ handling. > > > > Having more details, e.g., a stacktrace fragment from a soft lockup, > > can also help people connect a problem they're seeing with the > > solution. It's pretty hard to google for "kernel hang," but if you > > can google for a soft lockup in a specific function, that can be much > > more useful. > > The thing is not disabling msi interrupts for the case described in the > buzilla report is the wrong fix. > > The report is about a buggy driver doing the wrong thing. Until someone > ships a system that is msi native (aka no intx support) disabling msi > interrupts as shutdown is the right thing to do. If there is something > that handles intx interrupts it is not an msi native system. > > The real bug is probably disabling bugging interrupt detection on the > kernel command line. > > Beyond that to handle kexec cleanly something needs to stop the > interrupts and stop the the DMA transfers. Which in the short term > means someone probably needs to write a shutdown method for the buggy > driver. > > An interrupt coming in almost always implies a DMA having completed, > and if that DMA completed in the wrong spot the kexec'd kernel will be > toast. > > We disable interrupts at boot so that a kernel started with > kexec-on-panic (which doesn't shut anything down) can boot. There are > probably other valid use cases (like native msi interrupts) but I am not > aware of them. But according to the pci spec shutting down msi > interrupts at boot should be a noop. > > So in summary not disabling MSI/MSI-X at shutdown is the wrong fix, > and someone needs to fix a buggy driver. > > Eric I'm not all that worried about this patch making it into stable. So I suggest for now we ignore the bugzilla and just focus on the patch itself. And the patch itself is not about a buggy driver. It's about a correct driver causing screaming interrupts because pci core decided to disable msi at shutdown. Which is not necessary for two reasons: - because previous patches disable msi when kexec starts now - because suppressing DMA automatically suppresses MSI as well -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html