Re: Bug: Completion-Wait loop timed out with vfio

Tasos Sahanidis <tasos@xxxxxxxxxxxx> · Mon, 6 Mar 2023 11:49:37 +0200

On 2023-03-03 18:41, Alex Williamson wrote:>>> But it suddenly dawns on
me that you're assigning a Radeon HD 7790,
>>> which is one of the many AMD GPUs which is plagued by reset problems.
>>> I wonder if that's a factor there.  This particular GPU even has
>>> special handling in QEMU to try to manually reset the device, and which
>>> likely has never been tested since adding runtime power management
>>> support.  In fact, I'm surprised anyone is doing regular device
>>> assignment with an HD 7790 and considers it a normal, acceptable
>>> experience even with the QEMU workarounds.  
>>
>> I had no idea. I always assumed that because it worked out of the box
>> ever since I first tried passing it through, it wasn't affected by these
>> reset issues. I never had any trouble with it until now.
> 
> IIRC, so long as the VM is always booting and cleanly shutting down,
> then the QEMU quirk is sufficient, but if you need to kill QEMU the GPU
> might be in a bad state that requires a host reboot to recover.
> 

I tried SIGKILLing QEMU a few times and the card kept working.

>>> I certainly wouldn't feel comfortable proposing a quirk for the
>>> downstream port to disable D3hot for an issue only seen when assigning
>>> a device with such a nefarious background relative to device
>>> assignment.  It does however seem like there are sufficient options in
>>> place to work around the issue, either disabling power management at
>>> the vfio-pci driver, or specifically for the downstream port via sysfs.
>>> I don't really have any better suggestions given our limited ability to
>>> test and highly suspect target device.  Any other ideas, Abhishek?
>>> Thanks,
>>>
>>> Alex  
>>
>> This actually gave me an idea on how to check if it's the graphics card
>> that's at fault, or if it is QEMU's workarounds.
>>
>> I booted up the system as usual and let vfio-pci take over the device.
>> Both the device itself and the PCIe port were at D3hot. I manually
>> forced the PCIe port to switch to D0, with the GPU remaining at D3hot. I
>> then proceeded to start up the VM, and there were no errors in dmesg.
>>
>> If it's even possible, it sounds like QEMU might be doing something
>> before the PCIe port is (fully?) out of D3hot, and thus the card tries
>> to do something which makes the IOMMU unhappy.
>>
>> Is there something in either the rpm trace, or elsewhere that can help
>> me dig into this further?
> 
> That's interesting to find.  There are quirks in the kernel that don't
> disable D3hot, but just extend the suspend/resume time.  If you're
> slightly comfortable with coding and building the kernel, you could try
> something like below.  With the level of information we have, I'd feel
> more comfortable only proposing to extend the resume time for the 7790
> and not the downstream port, but I've put both in below to play with.
> 
> You can comment out one of the DECLARE... lines to disable each.  The 20
> value here is in ms and I have no idea what it should be.  There are a
> couple quirks that use this 20ms value and a bunch of Intel device IDs
> set an equivalent value to 120ms.  Experiment and see if you can find
> something that works reliably.  Thanks,
> 
> Alex
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44cab813bf95..d9ae376d9524 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -1956,6 +1956,15 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x15e0, quirk_ryzen_xhci_d3hot);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x15e1, quirk_ryzen_xhci_d3hot);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x1639, quirk_ryzen_xhci_d3hot);
>  
> +static void quirk_d3hot_test_delay(struct pci_dev *dev)
> +{
> +	quirk_d3hot_delay(dev, 20);
> +}
> +/* Radeon HD 7790 */
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x665c, quirk_d3hot_test_delay);
> +/* Matisse PCIe GPP Bridge Downstream Ports */
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x57a3, quirk_d3hot_test_delay);
> +
>  #ifdef CONFIG_X86_IO_APIC
>  static int dmi_disable_ioapicreroute(const struct dmi_system_id *d)
>  {
> 

The quirk on the downstream port changed nothing, which is both good and
bad I guess. The quirk on the 7790, when set to 120ms actually stopped
the error messages, but only when the VM was stopping. When the VM was
starting, the messages remained the same, which is puzzling. The delay
applies when going from D3 to D0, which happens when the VM starts, not
when it stops... I tried it as high as 500ms and nothing else changed.

I looked at QEMU's source, and I'll try both disabling the reset
temporarily, to see if the errors go away, and also adding some delays
in there in different areas (as there are a few already).

--
Tasos