Re: [PATCH] PCI: Add quirk to clear MSI-X

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/9/23 16:30, Bjorn Helgaas wrote:
On Thu, Mar 09, 2023 at 12:32:41PM -0600, Limonciello, Mario wrote:
On 3/9/2023 12:25, Bjorn Helgaas wrote:
...

https://gitlab.freedesktop.org/agd5f/linux/-/commit/07494a25fc8881e122c242a46b5c53e0e4403139

That nbio_v7.2.c patch and this patch don't look anything alike.  It
looks like the nbio_v7.2.c patch might run once?  Could *this* be done
once at enumeration-time, too?

They don't look anything alike because they're attacking the problem from
different angles.

Why do we need different angles?

The GPU driver approach only works if the GPU is enabled. If the GPU could never be disabled then it alone would be sufficient.


The NBIO patch fixes the initialization value for the internal registers.
This is what the BIOS "should" have done.  When the internal registers are
configured properly then the behavior the kernel expects works as well.

The NBIO patch will run both at amdgpu startup as well as when resuming from
suspend.

If initializing something as BIOS should have done makes the hardware
work correctly, isn't once enough?  Why does the NBIO patch need to
run at resume-time?

During suspend some internal registers are in a power domain that the state will be lost. These are typically restored by the BIOS to the values defined in initialization tables before handing control back to the OS.



This patch we're discussing treats the symptoms of the deficiency and avoids
the impact.
This patch runs any time the controller is runtime resumed.  So, yes it will
run more frequently.  Because this patch is treating the symptoms it needs
to be applied every single time the controller exits D3.

This patch runs at *suspend*-time (DECLARE_PCI_FIXUP_SUSPEND), not
resume-time.

The difference is important because with this broken BIOS, MSI-X is
disabled between the suspend quirk and some distant point in resume.
With non-broken BIOS, MSI-X remains *enabled* for at least part of
that period, and I don't want to have to figure out whether that
difference is important.

I'll let Basavaraj comment on the timing here with the behavior workaround and sequence of events.


We have fragments of a coherent commit log, but it's not quite a
complete story yet.  I think so far we have:

   - Issue affects only the 1022:15b8 USB controller (well, I guess it
     also affects some GPU device?)

Same device.  It's just a way to access the internal registers.

   - Only a problem when BIOS doesn't initialize controller correctly
   - Controller claims to preserve internal state on D3hot->D0
     transition, but it doesn't
   - D0->D3hot->D0 transitions do preserve external PCI_MSIX_FLAGS
     state; only internal state is lost
   - When MSI-X is enabled and controller transitions D0->D3hot->D0,
     MSI-X appears enabled per PCI_MSIX_FLAGS, but is actually
     *disabled* because the internal state was lost
   - MSI-X being disabled leads to xhci_hcd command timeouts because
     interrupts are missed
   - Not possible for an enumeration-time quirk to fix the controller
     initialization problem (why not?)
   - Writing PCI_MSIX_FLAGS with a *different* value fixes the internal
     state; writing the same value does nothing
   - A suspend- or resume-time quirk can work around this, and this is
     safe on *all* 1022:15b8 devices regardless of whether the BIOS is
     broken
   - The same approach can't be used for both 1022:15b8 and the GPU
     device because ...?

Bjorn




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux