Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

Guilherme Piccoli <gpiccoli@xxxxxxxxxxxxx> · Tue, 17 Nov 2020 09:04:07 -0300

On Mon, Nov 16, 2020 at 10:07 PM Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
> [...]
> > I think we need to disable MSIs in the crashing kernel before the
> > kexec.  It adds a little more code in the crash_kexec() path, but it
> > seems like a worthwhile tradeoff.
>
> Disabling MSIs in the b0rken kernel is not possible.
>
> Walking the device tree or even a significant subset of it hugely
> decreases the chances that we will run into something that is incorrect
> in the known broken kernel.  I expect the code to do that would double
> or triple the amount of code that must be executed in the known broken
> kernel.  The last time something like that happened (switching from xchg
> to ordinary locks) we had cases that stopped working.  Walking all of
> the pci devices in the system is much more invasive.
>

I think we could try to decouple this problem in 2, if that makes
sense. Bjorn/others can jump in and correct me if I'm wrong. So, the
problem is to walk a PCI topology tree, identify every device and
disable MSI(/INTx maybe) in them, and the big deal with doing that in
the broken kernel before the kexec is that this task is complex due to
the tree walk, mainly. But what if we keep a table (a simple 2-D
array) with the address and data to be written to disable the MSIs,
and before the kexec we could have a parameter enabling a function
that just goes through this array and performs the writes?

The table itself would be constructed by the PCI core (and updated in
the hotplug path), when device discovery is happening. This table
would live in a protected area in memory, with no write/execute
access, this way if the kernel itself tries to corrupt that, we get a
fault (yeah, I know DMAs can corrupt anywhere, but IOMMU could protect
against that). If the parameter "kdump_clear_msi" is passed in the
cmdline of the regular kernel, for example, then the function walks
this simple table and performs the devices' writes before the kexec...

If that's not possible or a bad idea for any reason, I still think the
early_quirks() idea hereby proposed is not something we should
discard, because it'll help a lot of users even with its limitations
(in our case it worked very well).
Also, taking here the opportunity to clarify my understanding about
the limitations of that approach: Bjorn, in our reproducer machine we
had 3 parents in the PCI tree (as per lspci -t), 0000:00, 0000:ff and
0000:80 - are those all under "segment 0" as per your verbiage? In our
case the troublemaker was under 0000:80, and the early_quirks() shut
the device up successfully.

Thanks,

Guilherme