Re: Bug: Completion-Wait loop timed out with vfio

Abhishek Sahu <abhsahu@xxxxxxxxxx> · Mon, 27 Feb 2023 11:03:46 +0530

On 2/25/2023 11:55 AM, Tasos Sahanidis wrote:
> Hello everyone,
> 
> Attempting to pass through my graphics card to a VM with kernel 
>> = 5.19.results in the following (host):
> 
> [   72.645091] AMD-Vi: Completion-Wait loop timed out
> [   72.791448] AMD-Vi: Completion-Wait loop timed out
> [   72.937768] AMD-Vi: Completion-Wait loop timed out
> [   73.084388] AMD-Vi: Completion-Wait loop timed out
> [   73.231661] AMD-Vi: Completion-Wait loop timed out
> [   73.231711] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f000 flags=0x0050]
> [   73.231724] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f040 flags=0x0050]
> [   73.231734] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f080 flags=0x0050]
> [   73.231743] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f0c0 flags=0x0050]
> [   73.231752] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f100 flags=0x0050]
> [   73.231761] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f140 flags=0x0050]
> [   73.231770] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f180 flags=0x0050]
> [   73.231779] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f1c0 flags=0x0050]
> [   73.231788] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f200 flags=0x0050]
> [   73.231797] ahci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0017 address=0xc5f3f240 flags=0x0050]
> [   73.377900] AMD-Vi: Completion-Wait loop timed out
> [   73.500538] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=06:00.0 address=0x1001e4600]
> [   73.546431] AMD-Vi: Completion-Wait loop timed out
> [   73.693772] AMD-Vi: Completion-Wait loop timed out
> [   73.847385] AMD-Vi: Completion-Wait loop timed out
> [   74.001796] AMD-Vi: Completion-Wait loop timed out
> [   74.148077] AMD-Vi: Completion-Wait loop timed out
> [   74.168380] virbr0: port 2(vnet0) entered learning state
> [   74.294937] AMD-Vi: Completion-Wait loop timed out
> [   74.296484] ata2.00: exception Emask 0x20 SAct 0x7e703fff SErr 0x0 action 0x6 frozen
> [   74.296492] ata2.00: irq_stat 0x20000000, host bus error
> [   74.296496] ata2.00: failed command: WRITE FPDMA QUEUED
> [   74.296498] ata2.00: cmd 61/08:00:c0:ec:91/00:00:01:00:00/40 tag 0 ncq dma 4096 out
>                         res 40/00:34:20:eb:91/00:00:01:00:00/40 Emask 0x20 (host bus error)
> [   74.296507] ata2.00: status: { DRDY }
> [more ATA errors]
> [   74.296724] ata2: hard resetting link
> [   74.430739] AMD-Vi: Completion-Wait loop timed out
> [   74.502557] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=06:00.0 address=0x1001e4660]
> [   74.502563] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=06:00.0 address=0x1001e4680]
> [   74.680713] vfio-pci 0000:06:00.0: enabling device (0000 -> 0003)
> [   74.681219] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
> [   74.681235] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
> [   74.700687] vfio-pci 0000:06:00.1: enabling device (0000 -> 0002)
> [   74.772816] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [   74.775906] ata2.00: configured for UDMA/133
> [   74.775957] ata2: EH complete
> [   74.935315] AMD-Vi: Completion-Wait loop timed out
> [   75.073590] AMD-Vi: Completion-Wait loop timed out
> [   75.212946] AMD-Vi: Completion-Wait loop timed out
> [   75.379316] AMD-Vi: Completion-Wait loop timed out
> [   75.504512] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=06:00.0 address=0x1001e46f0]
> 
> Stopping the VM results in similar messages.
> 
> The card is an AMD Radeon HD 7790 (1002:665c) and shows up at 06:00.0 on
> the host. This is a Ryzen system with an ASUS "TUF GAMING X570-PLUS".
> Userspace virt-related packages are all stock from Ubuntu 20.04.
> 
> While these messages are printed, sometimes the cursor and audio
> stutter. These temporary freezes have also caused file system
> corruption. The graphics card is non functional in this state.
> 
> Bisecting this shows that the issue was introduced by:
> 7ab5e10eda02d ("vfio/pci: Move the unused device into low power state with runtime PM").
> 
> Reverting that commit in 5.19 results in GPU passthrough working as
> expected. The patch doesn't cleanly revert on kernels newer than 5.19.
> 
> --
> Tasos

 Thanks Tasos.

 The patch enables the runtime power management. Previously, when the device is unused
 state, then it will be put in D3hot state. Now, it will be put into D3cold.
 In D3cold, the device power will be removed completely.

 If the issue is happening after this patch that means somehow the runtime power
 management is not working as expected with this device or platform.

 Is it possible to try following things at your end to get more information,

 1. Set disable_idle_d3 module parameter set and check if this issue happens.
    It can be done by adding following entry in command line

               vfio_pci.disable_idle_d3=1

 2. Without starting the VM, check the status of following sysfs entries.

    # cat /sys/bus/pci/devices/<B:D:F>/power/runtime_status
    # cat /sys/bus/pci/devices/<B:D:F>/power/power_state

 3. After issue happens, run the above command again.
 4. Do lspci -s <B:D:F> -vvv without starting the VM and see if it is printing the correct
    results and there is no new prints in the dmesg.
 5. Enable the ftrace events related with runtime power management before starting the VM

    # echo 1 > /sys/kernel/debug/tracing/events/rpm/enable

    and collect the trace logs after this issue happens

    # cat /sys/kernel/debug/tracing/trace

 6. Do you have any NVIDIA graphics card with you. If you have, then
    could you please check if issue happens with that.

  Thanks,
  Abhishek