On Thu, Mar 09, 2023 at 12:32:41PM -0600, Limonciello, Mario wrote: > On 3/9/2023 12:25, Bjorn Helgaas wrote: > ... > > > > https://gitlab.freedesktop.org/agd5f/linux/-/commit/07494a25fc8881e122c242a46b5c53e0e4403139 > > > > That nbio_v7.2.c patch and this patch don't look anything alike. It > > looks like the nbio_v7.2.c patch might run once? Could *this* be done > > once at enumeration-time, too? > > They don't look anything alike because they're attacking the problem from > different angles. Why do we need different angles? > The NBIO patch fixes the initialization value for the internal registers. > This is what the BIOS "should" have done. When the internal registers are > configured properly then the behavior the kernel expects works as well. > > The NBIO patch will run both at amdgpu startup as well as when resuming from > suspend. If initializing something as BIOS should have done makes the hardware work correctly, isn't once enough? Why does the NBIO patch need to run at resume-time? > This patch we're discussing treats the symptoms of the deficiency and avoids > the impact. > This patch runs any time the controller is runtime resumed. So, yes it will > run more frequently. Because this patch is treating the symptoms it needs > to be applied every single time the controller exits D3. This patch runs at *suspend*-time (DECLARE_PCI_FIXUP_SUSPEND), not resume-time. The difference is important because with this broken BIOS, MSI-X is disabled between the suspend quirk and some distant point in resume. With non-broken BIOS, MSI-X remains *enabled* for at least part of that period, and I don't want to have to figure out whether that difference is important. We have fragments of a coherent commit log, but it's not quite a complete story yet. I think so far we have: - Issue affects only the 1022:15b8 USB controller (well, I guess it also affects some GPU device?) - Only a problem when BIOS doesn't initialize controller correctly - Controller claims to preserve internal state on D3hot->D0 transition, but it doesn't - D0->D3hot->D0 transitions do preserve external PCI_MSIX_FLAGS state; only internal state is lost - When MSI-X is enabled and controller transitions D0->D3hot->D0, MSI-X appears enabled per PCI_MSIX_FLAGS, but is actually *disabled* because the internal state was lost - MSI-X being disabled leads to xhci_hcd command timeouts because interrupts are missed - Not possible for an enumeration-time quirk to fix the controller initialization problem (why not?) - Writing PCI_MSIX_FLAGS with a *different* value fixes the internal state; writing the same value does nothing - A suspend- or resume-time quirk can work around this, and this is safe on *all* 1022:15b8 devices regardless of whether the BIOS is broken - The same approach can't be used for both 1022:15b8 and the GPU device because ...? Bjorn