Re: Bug: Completion-Wait loop timed out with vfio

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 3 Mar 2023 08:33:14 +0200
Tasos Sahanidis <tasos@xxxxxxxxxxxx> wrote:

> On 2023-03-02 22:36, Alex Williamson wrote:
> > Yes, the fact that the NIC works suggests there's not simply a blatant
> > chip defect where we should blindly disable D3 power state support for
> > this downstream port.  I'm also not seeing any difference in the
> > downstream port configuration between the VM running after the port has
> > resumed from D3hot and the case where the port never entered D3hot.  
> 
> Agreed.
> 
> > But it suddenly dawns on me that you're assigning a Radeon HD 7790,
> > which is one of the many AMD GPUs which is plagued by reset problems.
> > I wonder if that's a factor there.  This particular GPU even has
> > special handling in QEMU to try to manually reset the device, and which
> > likely has never been tested since adding runtime power management
> > support.  In fact, I'm surprised anyone is doing regular device
> > assignment with an HD 7790 and considers it a normal, acceptable
> > experience even with the QEMU workarounds.  
> 
> I had no idea. I always assumed that because it worked out of the box
> ever since I first tried passing it through, it wasn't affected by these
> reset issues. I never had any trouble with it until now.

IIRC, so long as the VM is always booting and cleanly shutting down,
then the QEMU quirk is sufficient, but if you need to kill QEMU the GPU
might be in a bad state that requires a host reboot to recover.

> > I certainly wouldn't feel comfortable proposing a quirk for the
> > downstream port to disable D3hot for an issue only seen when assigning
> > a device with such a nefarious background relative to device
> > assignment.  It does however seem like there are sufficient options in
> > place to work around the issue, either disabling power management at
> > the vfio-pci driver, or specifically for the downstream port via sysfs.
> > I don't really have any better suggestions given our limited ability to
> > test and highly suspect target device.  Any other ideas, Abhishek?
> > Thanks,
> > 
> > Alex  
> 
> This actually gave me an idea on how to check if it's the graphics card
> that's at fault, or if it is QEMU's workarounds.
> 
> I booted up the system as usual and let vfio-pci take over the device.
> Both the device itself and the PCIe port were at D3hot. I manually
> forced the PCIe port to switch to D0, with the GPU remaining at D3hot. I
> then proceeded to start up the VM, and there were no errors in dmesg.
> 
> If it's even possible, it sounds like QEMU might be doing something
> before the PCIe port is (fully?) out of D3hot, and thus the card tries
> to do something which makes the IOMMU unhappy.
> 
> Is there something in either the rpm trace, or elsewhere that can help
> me dig into this further?

That's interesting to find.  There are quirks in the kernel that don't
disable D3hot, but just extend the suspend/resume time.  If you're
slightly comfortable with coding and building the kernel, you could try
something like below.  With the level of information we have, I'd feel
more comfortable only proposing to extend the resume time for the 7790
and not the downstream port, but I've put both in below to play with.

You can comment out one of the DECLARE... lines to disable each.  The 20
value here is in ms and I have no idea what it should be.  There are a
couple quirks that use this 20ms value and a bunch of Intel device IDs
set an equivalent value to 120ms.  Experiment and see if you can find
something that works reliably.  Thanks,

Alex

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44cab813bf95..d9ae376d9524 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -1956,6 +1956,15 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x15e0, quirk_ryzen_xhci_d3hot);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x15e1, quirk_ryzen_xhci_d3hot);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x1639, quirk_ryzen_xhci_d3hot);
 
+static void quirk_d3hot_test_delay(struct pci_dev *dev)
+{
+	quirk_d3hot_delay(dev, 20);
+}
+/* Radeon HD 7790 */
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x665c, quirk_d3hot_test_delay);
+/* Matisse PCIe GPP Bridge Downstream Ports */
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x57a3, quirk_d3hot_test_delay);
+
 #ifdef CONFIG_X86_IO_APIC
 static int dmi_disable_ioapicreroute(const struct dmi_system_id *d)
 {




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux