Running VMs with an eGPU and VFIO: from flaky (<= 5.12.x) to broken (5.13.x)

Andrej Podzimek via Virtualization <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx> · Sun, 11 Jul 2021 14:15:53 +0200

Dear virtualization mailing list,

My question may well be misplaced, because it's Thunderbolt-, eGPU- as well as NVidia-related, but I'm out of ideas where else to ask. (Should I ask in a qemu- or libvirt-specific list instead? If so, please give me a hint.)

First, here's the configuration of the physical (host) machine:

        Command line: pcie_ports=native pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=256M,hpmmioprefsize=16G mem_encrypt=on
           lspci -tv: https://pastebin.com/raw/usBudC1y
         Motherboard: ASRock x570 Creator with BIOS 3.50
                 CPU: AMD Ryzen 3950X
              System: ArchLinux with kernel 5.12.15 / 5.13.1
      eGPU enclosure: Razer Core X Chroma
            eGPU GPU: NVidia Quadro P5000
       UEFI settings: Above 64b decoding, IOMMU and SR-IOV all *enabled*
    Other PCIe slots:
                     GPU: AMD Radeon Pro W5700
                      M2: Two Seagate FireCuda 520 (ZP2000GM30002)
                    WiFi: Intel AX200 (factory-default)

The eGPU is configured like this in libvirt:

    <hostdev mode="subsystem" type="pci" managed="yes">
      <source><address domain="0x0000" bus="0x3d" slot="0x00" function="0x0"/></source>
      <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
    </hostdev>

Now the problem: Forwarding of the NVidia card inside the eGPU into virtual machines was flaky up to 5.12.x (i.e., sometimes worked, sometimes didn't) and stopped working entirely in 5.13:

    virsh # start FreeBSD
    error: Failed to start domain 'FreeBSD'
    error: internal error: qemu unexpectedly closed the monitor: 2021-07-11T10:34:09.102381Z qemu-system-x86_64: -device vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:3d:00.0: error getting device from group 49: Invalid argument
    Verify all devices in group 49 are bound to vfio-<bus> or pci-stub and not already in use

    virsh # start Windows
    error: Failed to start domain 'Windows'
    error: internal error: qemu unexpectedly closed the monitor: qxl_send_events: spice-server bug: guest stopped, ignoring
    2021-07-11T10:34:36.163549Z qemu-system-x86_64: -device vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.7,addr=0x0: vfio_listener_region_add received unaligned region
    2021-07-11T10:34:39.432499Z qemu-system-x86_64: -device vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.7,addr=0x0: vfio_listener_region_del received unaligned region
    2021-07-11T10:34:39.567039Z qemu-system-x86_64: -device vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.7,addr=0x0: vfio 0000:3d:00.0: error getting device from group 49: Invalid argument
    Verify all devices in group 49 are bound to vfio-<bus> or pci-stub and not already in use

============
With 5.12.x:

There were "lucky" and "unlucky" boots/uptimes. VMs could be started and restarted arbitrarily during the "lucky" uptimes and the NVidia eGPU worked flawlessly. During an "unlucky" uptime, the errors above popped up every single time and no VMs using the eGPU could be started. Restarts of the eGPU did not help. The likelihood of a "lucky" uptime was roughly 1:3, so it took quite a few reboots to get there. :-( /o\
============

============
With 5.13.x:

After boot, the eGPU on Thunderbolt initially doesn't work at all. It won't show up in lspci, the nvidia module is not loaded etc. Switching the eGPU off/on won't help. Surprisingly, the only way to make it initialize (that I've discovered thus far) is:
    modprobe -r thunderbolt
    modprobe thunderbolt

After that^^^ the eGPU and NVidia GPU are detected, modules are loaded, nvidia-smi works and shows information etc., but attempts at VM startup _always_ produces the errors above. I have not seen a "lucky" uptime in >50 boots. :-( Also, before unloading+reloading of thunderbolt, there is simply no device 3d:00.0 anywhere on PCI (and no trace of NVidia elsewhere), so that machine state is a (VM) non-starter.

What else I tried:
    * options thunderbolt start_icm=1  -- no change (plus admittedly I have no clue what the internal connection manager means/does)
    * options vfio_iommu_type1 disable_hugepages=1  -- "What if the 'unaligned region' is related to huge pages?" No change here either. /o\
    * a huge lot of reboots, Thunderbolt disconnects/reconnects etc. Nope. It won't work.
============

Final note: Without the extra command line tokens, namely pcie_ports=native pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=256M,hpmmioprefsize=16G, the NVidia eGPU just won't work, neither on 5.12.x nor on 5.13.x. Way more details about that are here:
    https://egpu.io/forums/postid/90608/
    https://bbs.archlinux.org/viewtopic.php?id=261303

What should I try next to debug the issue and, importantly, to keep my VMs working on 5.13.x and beyond? Any ideas, tips, magic kernel command line tokens etc. would be very helpful.

Cheers!
Andrej

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization