On Mon, Nov 16, 2020 at 10:07 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > [...] > > I think we need to disable MSIs in the crashing kernel before the > > kexec. It adds a little more code in the crash_kexec() path, but it > > seems like a worthwhile tradeoff. > > Disabling MSIs in the b0rken kernel is not possible. > > Walking the device tree or even a significant subset of it hugely > decreases the chances that we will run into something that is incorrect > in the known broken kernel. I expect the code to do that would double > or triple the amount of code that must be executed in the known broken > kernel. The last time something like that happened (switching from xchg > to ordinary locks) we had cases that stopped working. Walking all of > the pci devices in the system is much more invasive. > I think we could try to decouple this problem in 2, if that makes sense. Bjorn/others can jump in and correct me if I'm wrong. So, the problem is to walk a PCI topology tree, identify every device and disable MSI(/INTx maybe) in them, and the big deal with doing that in the broken kernel before the kexec is that this task is complex due to the tree walk, mainly. But what if we keep a table (a simple 2-D array) with the address and data to be written to disable the MSIs, and before the kexec we could have a parameter enabling a function that just goes through this array and performs the writes? The table itself would be constructed by the PCI core (and updated in the hotplug path), when device discovery is happening. This table would live in a protected area in memory, with no write/execute access, this way if the kernel itself tries to corrupt that, we get a fault (yeah, I know DMAs can corrupt anywhere, but IOMMU could protect against that). If the parameter "kdump_clear_msi" is passed in the cmdline of the regular kernel, for example, then the function walks this simple table and performs the devices' writes before the kexec... If that's not possible or a bad idea for any reason, I still think the early_quirks() idea hereby proposed is not something we should discard, because it'll help a lot of users even with its limitations (in our case it worked very well). Also, taking here the opportunity to clarify my understanding about the limitations of that approach: Bjorn, in our reproducer machine we had 3 parents in the PCI tree (as per lspci -t), 0000:00, 0000:ff and 0000:80 - are those all under "segment 0" as per your verbiage? In our case the troublemaker was under 0000:80, and the early_quirks() shut the device up successfully. Thanks, Guilherme