On 3/10/2023 6:27 AM, Mario Limonciello wrote: > On 3/9/23 16:30, Bjorn Helgaas wrote: >> On Thu, Mar 09, 2023 at 12:32:41PM -0600, Limonciello, Mario wrote: >>> On 3/9/2023 12:25, Bjorn Helgaas wrote: >>> ... >> >>>>>> https://gitlab.freedesktop.org/agd5f/linux/-/commit/07494a25fc8881e122c242a46b5c53e0e4403139 >>>>>> >>>> >>>> That nbio_v7.2.c patch and this patch don't look anything alike. It >>>> looks like the nbio_v7.2.c patch might run once? Could *this* be done >>>> once at enumeration-time, too? >>> >>> They don't look anything alike because they're attacking the problem >>> from >>> different angles. >> >> Why do we need different angles? > > The GPU driver approach only works if the GPU is enabled. If the GPU > could never be disabled then it alone would be sufficient. > >> >>> The NBIO patch fixes the initialization value for the internal >>> registers. >>> This is what the BIOS "should" have done. When the internal >>> registers are >>> configured properly then the behavior the kernel expects works as well. >>> >>> The NBIO patch will run both at amdgpu startup as well as when >>> resuming from >>> suspend. >> >> If initializing something as BIOS should have done makes the hardware >> work correctly, isn't once enough? Why does the NBIO patch need to >> run at resume-time? > > During suspend some internal registers are in a power domain that the > state will be lost. These are typically restored by the BIOS to the > values defined in initialization tables before handing control back to > the OS. > > >> >>> This patch we're discussing treats the symptoms of the deficiency >>> and avoids >>> the impact. >>> This patch runs any time the controller is runtime resumed. So, yes >>> it will >>> run more frequently. Because this patch is treating the symptoms it >>> needs >>> to be applied every single time the controller exits D3. >> >> This patch runs at *suspend*-time (DECLARE_PCI_FIXUP_SUSPEND), not >> resume-time. >> >> The difference is important because with this broken BIOS, MSI-X is >> disabled between the suspend quirk and some distant point in resume. >> With non-broken BIOS, MSI-X remains *enabled* for at least part of >> that period, and I don't want to have to figure out whether that >> difference is important. > > I'll let Basavaraj comment on the timing here with the behavior > workaround and sequence of events. As replied in the previous mail, Bjron's suggestion works well and holds good so I will change the quirk to apply in resume instead of suspend which also resolves the issue as below i.e. restoring during resume if MSI-X is enabled works. static void quirk_restore_msix_en(struct pci_dev *dev) { u16 ctrl; pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &ctrl); if (!(ctrl & PCI_MSIX_FLAGS_ENABLE)) return; ctrl &= ~PCI_MSIX_FLAGS_ENABLE; pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, ctrl); ctrl |= PCI_MSIX_FLAGS_ENABLE; pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, ctrl); } DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_VENDOR_ID_AMD, 0x15b8, quirk_restore_msix_en); Will change the commit message accordingly. I guess for the questions below we already answered. Please let us know if you need more clarifications. Thanks, -- Basavaraj > >> >> We have fragments of a coherent commit log, but it's not quite a >> complete story yet. I think so far we have: >> >> - Issue affects only the 1022:15b8 USB controller (well, I guess it >> also affects some GPU device?) > > Same device. It's just a way to access the internal registers. > >> - Only a problem when BIOS doesn't initialize controller correctly >> - Controller claims to preserve internal state on D3hot->D0 >> transition, but it doesn't >> - D0->D3hot->D0 transitions do preserve external PCI_MSIX_FLAGS >> state; only internal state is lost >> - When MSI-X is enabled and controller transitions D0->D3hot->D0, >> MSI-X appears enabled per PCI_MSIX_FLAGS, but is actually >> *disabled* because the internal state was lost >> - MSI-X being disabled leads to xhci_hcd command timeouts because >> interrupts are missed >> - Not possible for an enumeration-time quirk to fix the controller >> initialization problem (why not?) >> - Writing PCI_MSIX_FLAGS with a *different* value fixes the internal >> state; writing the same value does nothing >> - A suspend- or resume-time quirk can work around this, and this is >> safe on *all* 1022:15b8 devices regardless of whether the BIOS is >> broken >> - The same approach can't be used for both 1022:15b8 and the GPU >> device because ...? >> >> Bjorn >