Re: Bug: Completion-Wait loop timed out with vfio

Tasos Sahanidis <tasos@xxxxxxxxxxxx> · Fri, 3 Mar 2023 08:33:14 +0200

On 2023-03-02 22:36, Alex Williamson wrote:
> Yes, the fact that the NIC works suggests there's not simply a blatant
> chip defect where we should blindly disable D3 power state support for
> this downstream port.  I'm also not seeing any difference in the
> downstream port configuration between the VM running after the port has
> resumed from D3hot and the case where the port never entered D3hot.

Agreed.

> But it suddenly dawns on me that you're assigning a Radeon HD 7790,
> which is one of the many AMD GPUs which is plagued by reset problems.
> I wonder if that's a factor there.  This particular GPU even has
> special handling in QEMU to try to manually reset the device, and which
> likely has never been tested since adding runtime power management
> support.  In fact, I'm surprised anyone is doing regular device
> assignment with an HD 7790 and considers it a normal, acceptable
> experience even with the QEMU workarounds.

I had no idea. I always assumed that because it worked out of the box
ever since I first tried passing it through, it wasn't affected by these
reset issues. I never had any trouble with it until now.

> I certainly wouldn't feel comfortable proposing a quirk for the
> downstream port to disable D3hot for an issue only seen when assigning
> a device with such a nefarious background relative to device
> assignment.  It does however seem like there are sufficient options in
> place to work around the issue, either disabling power management at
> the vfio-pci driver, or specifically for the downstream port via sysfs.
> I don't really have any better suggestions given our limited ability to
> test and highly suspect target device.  Any other ideas, Abhishek?
> Thanks,
> 
> Alex

This actually gave me an idea on how to check if it's the graphics card
that's at fault, or if it is QEMU's workarounds.

I booted up the system as usual and let vfio-pci take over the device.
Both the device itself and the PCIe port were at D3hot. I manually
forced the PCIe port to switch to D0, with the GPU remaining at D3hot. I
then proceeded to start up the VM, and there were no errors in dmesg.

If it's even possible, it sounds like QEMU might be doing something
before the PCIe port is (fully?) out of D3hot, and thus the card tries
to do something which makes the IOMMU unhappy.

Is there something in either the rpm trace, or elsewhere that can help
me dig into this further?

--
Tasos