Re: [PATCH] PCI: Add quirk to clear MSI-X

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/9/2023 4:34 AM, Limonciello, Mario wrote:
> [Public]
>
>
>
>> -----Original Message-----
>> From: Bjorn Helgaas <helgaas@xxxxxxxxxx>
>> Sent: Wednesday, March 8, 2023 16:44
>> To: Natikar, Basavaraj <Basavaraj.Natikar@xxxxxxx>
>> Cc: bhelgaas@xxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; Limonciello, Mario
>> <Mario.Limonciello@xxxxxxx>; thomas@xxxxxxxxxxxx
>> Subject: Re: [PATCH] PCI: Add quirk to clear MSI-X
>>
>> Let's mention the vendor and device name in the subject to make the
>> log more useful.

Sure will change subject as below.
Add quirk on AMD 0x15b8 device to clear MSI-X enable bit

>>
>> On Mon, Mar 06, 2023 at 12:53:40PM +0530, Basavaraj Natikar wrote:
>>> One of the AMD USB controllers fails to maintain internal functional
>>> context when transitioning from D3 to D0, desynchronizing MSI-X bits.
>>> As a result, add a quirk to this controller to clear the MSI-X bits
>>> on suspend.
>> Is this a documented erratum?  Please include a citation if so.
>>
>> Are there any other AMD USB devices with the same defect?
> FYI - it's not a hardware defect, it's a BIOS defect.
>
>> The quick clears the Function Mask bit, so the MSI-X vectors may be
>> *unmasked* depending on the state of each vectors Mask bit.  I assume
>> the potential unmasking is safe because you also clear the MSI-X
>> Enable bit, so the function can't use MSI-X at all.

Sure, will remove Function Mask bit only clear MSI-X enable bit is enough,
actually MSI-X enable bit doesn't change the internal hardware and there
will be no interrupts after resume hence below command timeout and eventually
error observed more logs below:
[  418.572737] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling
*[ 423.724511] xhci_hcd 0000:03:00.0: Command timeout, USBSTS: 0x00000000****[ 423.724517] xhci_hcd 0000:03:00.0: Command timeout*
[  423.724519] xhci_hcd 0000:03:00.0: Abort command ring
[  425.740742] xhci_hcd 0000:03:00.0: No stop event for abort, ring start fail?
*[ 425.740771] xhci_hcd 0000:03:00.0: Error while assigning device slot ID****[ 425.740777] xhci_hcd 0000:03:00.0: Max number of devices this xHCI
host supports is 64*.
[  425.740782] usb usb5-port1: couldn't allocate usb_device
[  425.740794] xhci_hcd 0000:03:00.0: disable port 5-1, portsc: 0x6e1
[  425.740818] hub 5-0:1.0: hub_suspend
[  425.740826] usb usb5: bus auto-suspend, wakeup 1
[  425.740835] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling
[  425.740842] xhci_hcd 0000:03:00.0: xhci_suspend: stopping usb5 port polling.
[  425.756878] xhci_hcd 0000:03:00.0: // Setting command ring address to 0xffffe001
[  425.776898] xhci_hcd 0000:03:00.0: WARN: xHC save state timeout
[  425.776910] xhci_hcd 0000:03:00.0: PM: suspend_common(): xhci_pci_suspend+0x0/0x170 [xhci_pci] returns -110
[  425.776917] xhci_hcd 0000:03:00.0: hcd_pci_runtime_suspend: -110
[  425.776918] xhci_hcd 0000:03:00.0: can't suspend (hcd_pci_runtime_suspend returned -110)


will change function name accordingly quirk_clear_msix_en
and with only ctrl &= ~PCI_MSIX_FLAGS_ENABLE;

>>
>> All state is lost in D3cold, so I guess this problem must occur during
>> a D3hot to D0 transition, right?  I assume this device sets
>> No_Soft_Reset, so the function is supposed to return to D0active with
>> all internal state intact.  But this device returns to D0active with
>> the MSI-X internal state corrupted?
>>
>> I assume this relies on pci_restore_state() to restore the MSI-X
>> state.  Seems like that might be enough to restore the internal state
>> even without this quirk, but I guess it must not be.
> The important part is the register value changing to make
> the internal hardware move.  Because it restores identically it doesn't change
> the internal hardware.

Yes correct, even though pci_restore_state restores all pci registers states
including MSI-X bits __pci_restore_msix_state after resume but internal AMD
controller's MSI_X enable bit is out of sync and AMD controller fails to maintain 
internal MSI-X enable bits.

>
>>> Note: This quirk works in all scenarios, regardless of whether the
>>> integrated GPU is disabled in the BIOS.
>> I don't know how the integrated GPU is related to this USB controller,
>> but I assume this fact is important somehow?
> This bug is due to a BIOS bug with the initialization.  We also posted in
> parallel a different workaround that fixes the initialization to match what
> the BIOS should have set via the GPU driver.  
>
> It should be going in for 6.3-rc2.
> https://gitlab.freedesktop.org/agd5f/linux/-/commit/07494a25fc8881e122c242a46b5c53e0e4403139
>
> But because these are desktop processors, users can decide in BIOS setup
> whether the integrated GPU should be enabled or disabled and that
> workaround won't work if it's disabled.
>
>>> Cc: Mario Limonciello <mario.limonciello@xxxxxxx>
>>> Reported-by: Thomas Glanzmann <thomas@xxxxxxxxxxxx>
>>> Link: https://lore.kernel.org/linux-
>> usb/Y%2Fz9GdHjPyF2rNG3@xxxxxxxxxxxx/T/#u
>>
>> Apparently the symptom is one of these:
>>
>>   xhci_hcd 0000:0c:00.0: Error while assigning device slot ID: Command
>> Aborted
>>   xhci_hcd 0000:0c:00.0: Max number of devices this xHCI host supports is 64.
>>   usb usb1-port1: couldn't allocate usb_device
>>   xhci_hcd 0000:0c:00.0: WARN: xHC save state timeout
>>   xhci_hcd 0000:0c:00.0: PM: suspend_common():
>> xhci_pci_suspend+0x0/0x150 [xhci_pci] returns -110
>>   xhci_hcd 0000:0c:00.0: can't suspend (hcd_pci_runtime_suspend [usbcore]
>> returned -110)
>>
>> We should include the critical line or two in the commit log to help
>> users find the fix.
>>
>> I see this must be xhci_suspend() returning -ETIMEDOUT after
>> xhci_save_registers(), but I don't see the connection from there to a
>> PCI_FIXUP_SUSPEND.  Can you connect the dots for me?

Enabled more verbose logs in above comment for actual timeout due to MSI-X 
interrupt disabled internally even though in register shows MSI-X enabled
due to internal hardware de-synchronizing MSI-X enable bits on suspend so
explicit clear of MSI-X during suspend help to maintain and restores both 
internal register in sync with MSI-X enable bit.

[  418.572737] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling
*[ 423.724511] xhci_hcd 0000:03:00.0: Command timeout, USBSTS: 0x00000000****[ 423.724517] xhci_hcd 0000:03:00.0: Command timeout*
[  423.724519] xhci_hcd 0000:03:00.0: Abort command ring

>>> Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@xxxxxxx>
>>> ---
>>>  drivers/pci/quirks.c | 10 ++++++++++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>> index 44cab813bf95..ddf7100227d3 100644
>>> --- a/drivers/pci/quirks.c
>>> +++ b/drivers/pci/quirks.c
>>> @@ -6023,3 +6023,13 @@
>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2d,
>> dpc_log_size);
>>>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2f,
>> dpc_log_size);
>>>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a31,
>> dpc_log_size);
>>>  #endif
>>> +
>>> +static void quirk_clear_msix(struct pci_dev *dev)
>>> +{
>>> +	u16 ctrl;
>>> +
>>> +	pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS,
>> &ctrl);
>>> +	ctrl &= ~(PCI_MSIX_FLAGS_MASKALL | PCI_MSIX_FLAGS_ENABLE);
>>> +	pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS,
>> ctrl);
>>> +}
>>> +DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_AMD, 0x15b8,
>> quirk_clear_msix);
>>> --
>>> 2.25.1
>>>




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux