On Fri, Jan 24, 2020 at 6:34 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > Evan, > > Evan Green <evgreen@xxxxxxxxxxxx> writes: > > I did another experiment that I think lends credibility to my torn MSI > > hypothesis. I have the following change: > > > > And indeed, I get a machine check, despite the fact that MSI_DATA is > > overwritten just after address is updated. > > I don't have to understand why a SoC released in 2019 still has > unmaskable MSI especially as Inhell's own XHCI spec clearly documents > and recommends MSI-X. > > While your workaround (disabling MSI) works in this particular case it's > not really a good option: > > 1) Quite some devices have a bug where the legacy INTX disable does not > work reliably or is outright broken. That means MSI disable will > reroute to INTX. > > 2) I digged out old debug data which confirms that some silly devices > lose interrupts accross MSI disable/reenable if the INTX fallback is > disabled. > > And no, it's not a random weird device, it's part of a chipset which > was pretty popular a few years ago. I leave it as an excercise for > the reader to guess the vendor. > > Can you please apply the patch below? It enforces an IPI to the new > vector/target CPU when the interrupt is MSI w/o masking. It should > cure the issue. It goes without saying that I'm not proud of it. I'll feel just as dirty putting a tested-by on it :) I don't think this patch is complete. As written, it creates "recovery interrupts" for MSIs that are not maskable, however through the pci_msi_domain_write_msg() path, which is the one I seem to use, we make no effort to mask the MSI while changing affinity. So at the very least it would need a follow-on patch that attempts to mask the MSI, for MSIs that are maskable. __pci_restore_msi_state(), called in the resume path, does have this masking, but for some reason not pci_msi_domain_write_msg(). I'm also a bit concerned about all the spurious interrupts we'll be introducing. Not just the retriggering introduced here, but the fact that we never dealt with the torn interrupt. So in my case, XHCI will be sending an interrupt on the old vector to the new CPU, which could be registered to anything. I'm worried that not every driver in the system is hardened to receiving interrupts it's not prepared for. Perhaps the driver misbehaves, or perhaps it's a "bad" interrupt like the MCE interrupt that takes the system down. (I realize the MCE interrupt itself is not in the device vector region, but some other bad interrupt then). Now that you're on board with the torn write theory, what do you think about my "transit vector" proposal? The idea is this: - Reserve a single vector number on all CPUs for interrupts in transit between CPUs. - Interrupts in transit between CPUs are added to some sort of list, or maybe the transit vector itself. - __pci_msi_write_msg() would, after proper abstractions, essentially be doing this: pci_write(MSI_DATA, TRANSIT_VECTOR); pci_write(MSI_ADDRESS, new_affinity); pci_write(MSI_DATA, new_vector); - In the rare torn case I've found here, the interrupt will come in on <new CPU, transit_vector>, or <old CPU, transit_vector>. - The ISR for TRANSIT_VECTOR would go through and call the ISR for every IRQ in transit across CPUs. This does still result in a couple extra ISR calls, since multiple interrupts might be in transit across CPUs, but at least it's very rare. - CPU hotplug would keep the same logic it already has, retriggering TRANSIT_VECTOR if it happened to land on <old CPU, old vector>. - When the interrupt is confirmed on <new CPU, new vector>, remove the ISR from the TRANSIT_VECTOR list. If you think it's a worthwhile idea I can try to code it up. I've been running your patch for about 30 minutes, with no repro case. -Evan