Evan! On Mon, Apr 04 2022 at 12:11, Evan Green wrote: > To my surprise, I'm back with another MSI problem, and hoping to get > some advice on how to approach fixing it. Why am I not surprised? > What worries me is those IRQ "no longer affine" messages, as well as > my "EVAN don't touch hw" prints, indicating that requests to change > the MSI are being dropped. These ignored requests are coming in when > we try to migrate all IRQs off of the non-boot CPU, and they get > ignored because all devices are "frozen" at this point, and presumably > not in D0. They are disabled at that point. > To further try and prove that theory, I wrote a script to do the > hibernate prepare image step in a loop, but messed with XHCI's IRQ > affinity beforehand. If I move the IRQ to core 0, so far I have never > seen a hang. But if I move it to another core, I can usually get a > hang in the first attempt. I also very occasionally see wifi splats > when trying this, and those "no longer affine" prints are all the wifi > queue IRQs. So I think a wifi packet coming in at the wrong time can > do the same thing. > > I wanted to see what thoughts you might have on this. Should I try to > make a patch that moves all IRQs to CPU 0 *before* the devices all > freeze? Sounds a little unpleasant. Or should PCI be doing something > different to avoid this combination of "you're not allowed to modify > my MSIs, but I might still generate interrupts that must not be lost"? PCI cannot do much here and moving interrupts around is papering over the underlying problem. xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023 This sets up the interrupt when the driver is loaded xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee01000 4024 Ditto xhci_hcd 0000:00:0d.0: calling pci_pm_freeze+0x0/0xad @ 423, parent: pci0000:00 xhci_hcd 0000:00:14.0: calling pci_pm_freeze+0x0/0xad @ 4644, parent: pci0000:00 xhci_hcd 0000:00:14.0: pci_pm_freeze+0x0/0xad returned 0 after 0 usecs xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023 xhci_hcd 0000:00:0d.0: pci_pm_freeze+0x0/0xad returned 0 after 196000 usecs Those freeze() calls end up in xhci_suspend(), which tears down the XHCI and ensures that no interrupts are on flight. xhci_hcd 0000:00:0d.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4645, parent: pci0000:00 xhci_hcd 0000:00:0d.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 30 usecs xhci_hcd 0000:00:14.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4644, parent: pci0000:00 xhci_hcd 0000:00:14.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 3118 usecs Now the devices are disabled and not accessible xhci_hcd 0000:00:14.0: EVAN Don't touch hw 0 fee00000 4024 xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee1e000 4045 xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee00000 4045 xhci_hcd 0000:00:14.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 9, parent: pci0000:00 xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee00000 4024 This is the early restore _before_ the XHCI resume code is called This interrupt is targeted at CPU0 (it's the one which could not be written above). xhci_hcd 0000:00:14.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 5272 usecs xhci_hcd 0000:00:0d.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 1123, parent: pci0000:00 xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee00000 4045 Ditto xhci_hcd 0000:00:0d.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 623 usecs xhci_hcd 0000:00:14.0: calling pci_pm_thaw+0x0/0x7c @ 3856, parent: pci0000:00 xhci_hcd 0000:00:14.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs xhci_hcd 0000:00:0d.0: calling pci_pm_thaw+0x0/0x7c @ 4664, parent: pci0000:00 xhci_hcd 0000:00:0d.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs That means the suspend/resume logic is doing the right thing. How the XHCI ends up being confused here is a mystery. Cc'ed a few more folks. Thanks, tglx