On Thu, Mar 09, 2023 at 01:04:17PM +0530, Basavaraj Natikar wrote: > On 3/9/2023 4:34 AM, Limonciello, Mario wrote: > >> -----Original Message----- > >> From: Bjorn Helgaas <helgaas@xxxxxxxxxx> > >> Sent: Wednesday, March 8, 2023 16:44 > >> To: Natikar, Basavaraj <Basavaraj.Natikar@xxxxxxx> > >> Cc: bhelgaas@xxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; Limonciello, Mario > >> <Mario.Limonciello@xxxxxxx>; thomas@xxxxxxxxxxxx > >> Subject: Re: [PATCH] PCI: Add quirk to clear MSI-X > >> > >> Let's mention the vendor and device name in the subject to make the > >> log more useful. > > Sure will change subject as below. > Add quirk on AMD 0x15b8 device to clear MSI-X enable bit "0x15b8" is not really useful in a subject line. Use a name meaningful to users, like something "lspci" reports (I don't see "1002:15b8" in https://pci-ids.ucw.cz/read/PC/1002; it would be nice to add it) or at least something like "USB controller". You can look at the history of drivers/pci/quirks.c to see examples. > >> On Mon, Mar 06, 2023 at 12:53:40PM +0530, Basavaraj Natikar wrote: > >>> One of the AMD USB controllers fails to maintain internal functional > >>> context when transitioning from D3 to D0, desynchronizing MSI-X bits. > >>> As a result, add a quirk to this controller to clear the MSI-X bits > >>> on suspend. > > ... > > FYI - it's not a hardware defect, it's a BIOS defect. Commit log ("controller fails to maintain") suggested hardware defect to me; maybe could be clarified. If it's a defect in the way BIOS initialized something, maybe the workaround could be a one-time thing instead of an every-resume quirk? > >> The quick clears the Function Mask bit, so the MSI-X vectors may be > >> *unmasked* depending on the state of each vectors Mask bit. I assume > >> the potential unmasking is safe because you also clear the MSI-X > >> Enable bit, so the function can't use MSI-X at all. > > Sure, will remove Function Mask bit only clear MSI-X enable bit is enough, > actually MSI-X enable bit doesn't change the internal hardware and there > will be no interrupts after resume hence below command timeout and eventually > error observed more logs below: > > [ 418.572737] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling > *[ 423.724511] xhci_hcd 0000:03:00.0: Command timeout, USBSTS: 0x00000000****[ 423.724517] xhci_hcd 0000:03:00.0: Command timeout* > [ 423.724519] xhci_hcd 0000:03:00.0: Abort command ring > [ 425.740742] xhci_hcd 0000:03:00.0: No stop event for abort, ring start fail? > *[ 425.740771] xhci_hcd 0000:03:00.0: Error while assigning device slot ID****[ 425.740777] xhci_hcd 0000:03:00.0: Max number of devices this xHCI > host supports is 64*. > [ 425.740782] usb usb5-port1: couldn't allocate usb_device > [ 425.740794] xhci_hcd 0000:03:00.0: disable port 5-1, portsc: 0x6e1 > [ 425.740818] hub 5-0:1.0: hub_suspend > [ 425.740826] usb usb5: bus auto-suspend, wakeup 1 > [ 425.740835] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling > [ 425.740842] xhci_hcd 0000:03:00.0: xhci_suspend: stopping usb5 port polling. > [ 425.756878] xhci_hcd 0000:03:00.0: // Setting command ring address to 0xffffe001 > [ 425.776898] xhci_hcd 0000:03:00.0: WARN: xHC save state timeout > [ 425.776910] xhci_hcd 0000:03:00.0: PM: suspend_common(): xhci_pci_suspend+0x0/0x170 [xhci_pci] returns -110 > [ 425.776917] xhci_hcd 0000:03:00.0: hcd_pci_runtime_suspend: -110 > [ 425.776918] xhci_hcd 0000:03:00.0: can't suspend (hcd_pci_runtime_suspend returned -110) > > will change function name accordingly quirk_clear_msix_en > and with only ctrl &= ~PCI_MSIX_FLAGS_ENABLE; > > >> All state is lost in D3cold, so I guess this problem must occur during > >> a D3hot to D0 transition, right? I assume this device sets > >> No_Soft_Reset, so the function is supposed to return to D0active with > >> all internal state intact. But this device returns to D0active with > >> the MSI-X internal state corrupted? > >> > >> I assume this relies on pci_restore_state() to restore the MSI-X > >> state. Seems like that might be enough to restore the internal state > >> even without this quirk, but I guess it must not be. > > > > The important part is the register value changing to make > > the internal hardware move. Because it restores identically it doesn't change > > the internal hardware. > > Yes correct, even though pci_restore_state restores all pci registers states > including MSI-X bits __pci_restore_msix_state after resume but internal AMD > controller's MSI_X enable bit is out of sync and AMD controller fails to maintain > internal MSI-X enable bits. So the register value *change* is important, and you force a different value by writing something different at suspend-time so the value at restore-time will be different. That's a little obscure since those points are far separated. Also it changes the behavior (masking MSI-X at suspend-time), which complicates the analysis since we have to verify that we don't need MSI-X after the quirk runs. And the current quirk relies on the fact that PCI_MSIX_FLAGS_ENABLE is set, which again complicates the analysis (I guess if MSI-X is *not* enabled, you might not need the quirk at all?) Is there any way you could do the quirk at resume-time, e.g., if MSI-X is supposed to be enabled, first disable it and immediately re-enable it? > >>> Note: This quirk works in all scenarios, regardless of whether the > >>> integrated GPU is disabled in the BIOS. > >> > >> I don't know how the integrated GPU is related to this USB controller, > >> but I assume this fact is important somehow? > > > > This bug is due to a BIOS bug with the initialization. We also posted in > > parallel a different workaround that fixes the initialization to match what > > the BIOS should have set via the GPU driver. > > > > It should be going in for 6.3-rc2. > > https://gitlab.freedesktop.org/agd5f/linux/-/commit/07494a25fc8881e122c242a46b5c53e0e4403139 That nbio_v7.2.c patch and this patch don't look anything alike. It looks like the nbio_v7.2.c patch might run once? Could *this* be done once at enumeration-time, too? Bjorn