On 3/9/2023 4:34 AM, Limonciello, Mario wrote: > [Public] > > > >> -----Original Message----- >> From: Bjorn Helgaas <helgaas@xxxxxxxxxx> >> Sent: Wednesday, March 8, 2023 16:44 >> To: Natikar, Basavaraj <Basavaraj.Natikar@xxxxxxx> >> Cc: bhelgaas@xxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; Limonciello, Mario >> <Mario.Limonciello@xxxxxxx>; thomas@xxxxxxxxxxxx >> Subject: Re: [PATCH] PCI: Add quirk to clear MSI-X >> >> Let's mention the vendor and device name in the subject to make the >> log more useful. Sure will change subject as below. Add quirk on AMD 0x15b8 device to clear MSI-X enable bit >> >> On Mon, Mar 06, 2023 at 12:53:40PM +0530, Basavaraj Natikar wrote: >>> One of the AMD USB controllers fails to maintain internal functional >>> context when transitioning from D3 to D0, desynchronizing MSI-X bits. >>> As a result, add a quirk to this controller to clear the MSI-X bits >>> on suspend. >> Is this a documented erratum? Please include a citation if so. >> >> Are there any other AMD USB devices with the same defect? > FYI - it's not a hardware defect, it's a BIOS defect. > >> The quick clears the Function Mask bit, so the MSI-X vectors may be >> *unmasked* depending on the state of each vectors Mask bit. I assume >> the potential unmasking is safe because you also clear the MSI-X >> Enable bit, so the function can't use MSI-X at all. Sure, will remove Function Mask bit only clear MSI-X enable bit is enough, actually MSI-X enable bit doesn't change the internal hardware and there will be no interrupts after resume hence below command timeout and eventually error observed more logs below: [ 418.572737] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling *[ 423.724511] xhci_hcd 0000:03:00.0: Command timeout, USBSTS: 0x00000000****[ 423.724517] xhci_hcd 0000:03:00.0: Command timeout* [ 423.724519] xhci_hcd 0000:03:00.0: Abort command ring [ 425.740742] xhci_hcd 0000:03:00.0: No stop event for abort, ring start fail? *[ 425.740771] xhci_hcd 0000:03:00.0: Error while assigning device slot ID****[ 425.740777] xhci_hcd 0000:03:00.0: Max number of devices this xHCI host supports is 64*. [ 425.740782] usb usb5-port1: couldn't allocate usb_device [ 425.740794] xhci_hcd 0000:03:00.0: disable port 5-1, portsc: 0x6e1 [ 425.740818] hub 5-0:1.0: hub_suspend [ 425.740826] usb usb5: bus auto-suspend, wakeup 1 [ 425.740835] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling [ 425.740842] xhci_hcd 0000:03:00.0: xhci_suspend: stopping usb5 port polling. [ 425.756878] xhci_hcd 0000:03:00.0: // Setting command ring address to 0xffffe001 [ 425.776898] xhci_hcd 0000:03:00.0: WARN: xHC save state timeout [ 425.776910] xhci_hcd 0000:03:00.0: PM: suspend_common(): xhci_pci_suspend+0x0/0x170 [xhci_pci] returns -110 [ 425.776917] xhci_hcd 0000:03:00.0: hcd_pci_runtime_suspend: -110 [ 425.776918] xhci_hcd 0000:03:00.0: can't suspend (hcd_pci_runtime_suspend returned -110) will change function name accordingly quirk_clear_msix_en and with only ctrl &= ~PCI_MSIX_FLAGS_ENABLE; >> >> All state is lost in D3cold, so I guess this problem must occur during >> a D3hot to D0 transition, right? I assume this device sets >> No_Soft_Reset, so the function is supposed to return to D0active with >> all internal state intact. But this device returns to D0active with >> the MSI-X internal state corrupted? >> >> I assume this relies on pci_restore_state() to restore the MSI-X >> state. Seems like that might be enough to restore the internal state >> even without this quirk, but I guess it must not be. > The important part is the register value changing to make > the internal hardware move. Because it restores identically it doesn't change > the internal hardware. Yes correct, even though pci_restore_state restores all pci registers states including MSI-X bits __pci_restore_msix_state after resume but internal AMD controller's MSI_X enable bit is out of sync and AMD controller fails to maintain internal MSI-X enable bits. > >>> Note: This quirk works in all scenarios, regardless of whether the >>> integrated GPU is disabled in the BIOS. >> I don't know how the integrated GPU is related to this USB controller, >> but I assume this fact is important somehow? > This bug is due to a BIOS bug with the initialization. We also posted in > parallel a different workaround that fixes the initialization to match what > the BIOS should have set via the GPU driver. > > It should be going in for 6.3-rc2. > https://gitlab.freedesktop.org/agd5f/linux/-/commit/07494a25fc8881e122c242a46b5c53e0e4403139 > > But because these are desktop processors, users can decide in BIOS setup > whether the integrated GPU should be enabled or disabled and that > workaround won't work if it's disabled. > >>> Cc: Mario Limonciello <mario.limonciello@xxxxxxx> >>> Reported-by: Thomas Glanzmann <thomas@xxxxxxxxxxxx> >>> Link: https://lore.kernel.org/linux- >> usb/Y%2Fz9GdHjPyF2rNG3@xxxxxxxxxxxx/T/#u >> >> Apparently the symptom is one of these: >> >> xhci_hcd 0000:0c:00.0: Error while assigning device slot ID: Command >> Aborted >> xhci_hcd 0000:0c:00.0: Max number of devices this xHCI host supports is 64. >> usb usb1-port1: couldn't allocate usb_device >> xhci_hcd 0000:0c:00.0: WARN: xHC save state timeout >> xhci_hcd 0000:0c:00.0: PM: suspend_common(): >> xhci_pci_suspend+0x0/0x150 [xhci_pci] returns -110 >> xhci_hcd 0000:0c:00.0: can't suspend (hcd_pci_runtime_suspend [usbcore] >> returned -110) >> >> We should include the critical line or two in the commit log to help >> users find the fix. >> >> I see this must be xhci_suspend() returning -ETIMEDOUT after >> xhci_save_registers(), but I don't see the connection from there to a >> PCI_FIXUP_SUSPEND. Can you connect the dots for me? Enabled more verbose logs in above comment for actual timeout due to MSI-X interrupt disabled internally even though in register shows MSI-X enabled due to internal hardware de-synchronizing MSI-X enable bits on suspend so explicit clear of MSI-X during suspend help to maintain and restores both internal register in sync with MSI-X enable bit. [ 418.572737] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling *[ 423.724511] xhci_hcd 0000:03:00.0: Command timeout, USBSTS: 0x00000000****[ 423.724517] xhci_hcd 0000:03:00.0: Command timeout* [ 423.724519] xhci_hcd 0000:03:00.0: Abort command ring >>> Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@xxxxxxx> >>> --- >>> drivers/pci/quirks.c | 10 ++++++++++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c >>> index 44cab813bf95..ddf7100227d3 100644 >>> --- a/drivers/pci/quirks.c >>> +++ b/drivers/pci/quirks.c >>> @@ -6023,3 +6023,13 @@ >> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2d, >> dpc_log_size); >>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2f, >> dpc_log_size); >>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a31, >> dpc_log_size); >>> #endif >>> + >>> +static void quirk_clear_msix(struct pci_dev *dev) >>> +{ >>> + u16 ctrl; >>> + >>> + pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, >> &ctrl); >>> + ctrl &= ~(PCI_MSIX_FLAGS_MASKALL | PCI_MSIX_FLAGS_ENABLE); >>> + pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, >> ctrl); >>> +} >>> +DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_AMD, 0x15b8, >> quirk_clear_msix); >>> -- >>> 2.25.1 >>>