First of all, thanks everybody for the great insights/discussion! This thread ended-up being a great learning for (at least) me. Given the big amount of replies and intermixed comments, I wouldn't be able to respond inline to all, so I'll try another approach below. >From Bjorn: "I think [0] proposes using early_quirks() to disable MSIs at boot-time. That doesn't seem like a robust solution because (a) the problem affects all arches but early_quirks() is x86-specific and (b) even on x86 early_quirks() only works for PCI segment 0 because it relies on the 0xCF8/0xCFC I/O ports." Ah. I wasn't aware of that limitation, I thought enhancing the early_quirks() search to go through all buses would fix that, thanks for the clarification! And again, worth to clarify that this is not a problem affecting all arches _in practice_ - PowerPC for example has the FW primitives allowing a powerful PCI controller (out-of-band) reset, preventing this kind of issue usually. [0] https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@xxxxxxxxxxxxx >From Bjorn: "A crash_device_shutdown() could do something at the host bridge level if that's possible, or reset/disable bus mastering/disable MSI/etc on individual PCI devices if necessary." >From Lukas: "Guilherme's original patches from 2018 iterate over all 256 PCI buses. That might impact boot time negatively. The reason he has to do that is because the crashing kernel doesn't know which devices exist and which have interrupts enabled. However the crashing kernel has that information. It should either disable interrupts itself or pass the necessary information to the crashing kernel as setup_data or whatever. Guilherme's patches add a "clearmsi" command line parameter. I guess the idea is that the parameter is always passed to the crash kernel but the patches don't do that, which seems odd." Very good points Lukas, thanks! The reason of not adding the clearmsi thing as a default kdump procedure is kinda related to your first point: it impacts a bit boot-time, also it's an additional logic in the kdump kernel, which we know is (sometimes) the last resort in gathering additional data to debug a potentially complex issue. That said, I'd not like to enforce this kind of "policy" to everybody, so my implementation relies on having it as a parameter, and the kdump userspace counter-part could then have a choice in adding or not such mechanism to the kdump kernel parameter list. About passing the data to next kernel, this is very interesting, we could do something like that either through setup_data (as you said) or using a new proposal, the PKRAM thing [a]. This way we wouldn't need a crash_device_shutdown(), but instead when kernel is loading a crash kernel (through kexec -p) we can collect all the necessary information that'll be passed to the crash kernel (obviously that if we are collecting PCI topology information, we need callbacks in the PCI hotplug add/del path to update this information). [a] https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@xxxxxxxxxx/ Below, inline reply to Eric's last message. On 15/11/2020 17:46, Eric W. Biederman wrote: >> [...] >> An MSI interrupt is a (DMA) write to the local APIC base address >> 0xfeexxxxx which has the target CPU and control bits encoded in the >> lower bits. The written data is the vector and more control bits. >> >> The only way to stop that is to shut it up at the PCI device level. > > Or to write to magic chipset registers that will stop transforming DMA > writes to 0xfeexxxxx into x86 interrupts. With an IOMMU I know x86 has > such registers (because the point of the IOMMU is to limit the problems > rogue devices can cause). Without an IOMMU I don't know if x86 has any > such registers. I remember that other platforms have an interrupt > controller that does provide some level of control. That x86 does not > is what makes this an x86 specific problem. > [...] > Looking at patch 3/3 what this patchset does is an early disable of > of the msi registers. Which is mostly reasonable. Especially as has > been pointed out the only location the x86 vector and x86 cpu can > be found is in the msi configuration registers. > > That also seems reasonable. But Bjorn's concern about not finding all > devices in all domains does seem real. > [...] > So if we can safely and reliably disable DMA and MSI at the generic PCI > device level during boot up I am all for it. > > > How difficult would it be to tell the IOMMUs to stop passing traffic > through in an early pci quirk? The problem setup was apparently someone > using the device directly from a VM. So I presume there is an IOMMU > in that configuration. This is a good idea I think - we considered something like that in theory while working the problem back in 2018, but given I'm even less expert in IOMMU that I already am in PCI, the option was to go with the PCI approach. And you are right, the original problem is a device in PCI-PT to a VM, and a tool messing with the PF device in the SR-IOV (to collect some stats) from the host side, through VFIO IIRC. Anyway, we could split the problem in two after all the considerations in the thread, I believe: (1) If possible to set-up the IOMMU to prevent MSIs, by "blocking" the DMA writes for PCI devices *until* PCI core code properly initialize the devices, that'd handle the majority of the cases I guess (IOMMU usage is quite common nowadays). (2) Collecting PCI topology information in a running kernel and passing that to the kdump kernel would allow us to disable the PCI devices MSI capabilities, the only problem here is that I couldn't see how to do that early enough, before the 'sti' executes, without relying in early_quirks(). ACPI maybe? As per Bjorn comment when explaining the limitations of early_quirks(), this problem seems not to be trivial. I'm a bit against doing that in crash_kexec() for the reasons mentioned in the thread, specially by Thomas and Eric - but if there's no other way...maybe this is the way to go. Cheers, Guilherme P.S. Fixed Gavin's bouncing address, sorry for that. _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec