Lost MSIs during hibernate

Evan Green <evgreen@xxxxxxxxxx> · Mon, 4 Apr 2022 12:11:39 -0700

Hi Thomas et al,
To my surprise, I'm back with another MSI problem, and hoping to get
some advice on how to approach fixing it.

Summary: I think MSIs are lost across the hibernate freeze/thaw
transition on the way down because __pci_write_msi_msg() drops the
write if the device is not in D0.

Details:
I've been playing with hibernation on an Alderlake device, and have
been running into problems where the freeze/thaw transition that
should generate the hibernate image ends up hanging (eg before we even
start writing the image out to disk). When it hangs I find it in
usb_kill_urb(), an error path that comes out of a failed attempt to
send a control packet to a hub port coming from usb_resume().
Occasionally, I see the classic "HC died; cleaning up" message
instead. XHCI in general appears to be very sensitive to lost MSIs, so
I started down that road.

I printed the three major paths through __pci_write_msi_msg() so I
could see what the XHCI controller was ending up with when it hung.
You can see a full boot and hibernate attempt sequence that results in
a hang here (sorry there's other cruft in there):

https://pastebin.com/PFd3x1k0

What worries me is those IRQ "no longer affine" messages, as well as
my "EVAN don't touch hw" prints, indicating that requests to change
the MSI are being dropped. These ignored requests are coming in when
we try to migrate all IRQs off of the non-boot CPU, and they get
ignored because all devices are "frozen" at this point, and presumably
not in D0.

So my theory is XHCI for whatever reason boots affinitized to a
non-boot CPU. We go through pci_pm_freeze(), then try to take the
non-boot CPUs down. The request to move the MSI off of the dead CPU is
ignored, and then XHCI generates an interrupt during the period while
that non-boot CPU is dead.

To further try and prove that theory, I wrote a script to do the
hibernate prepare image step in a loop, but messed with XHCI's IRQ
affinity beforehand. If I move the IRQ to core 0, so far I have never
seen a hang. But if I move it to another core, I can usually get a
hang in the first attempt. I also very occasionally see wifi splats
when trying this, and those "no longer affine" prints are all the wifi
queue IRQs. So I think a wifi packet coming in at the wrong time can
do the same thing.

I wanted to see what thoughts you might have on this. Should I try to
make a patch that moves all IRQs to CPU 0 *before* the devices all
freeze? Sounds a little unpleasant. Or should PCI be doing something
different to avoid this combination of "you're not allowed to modify
my MSIs, but I might still generate interrupts that must not be lost"?

-Evan