drivers/pci: (and/or KVM): Slow PCI initialization during VM boot with passthrough of large BAR Nvidia GPUs on DGX H100

Mitchell Augustin <mitchell.augustin@xxxxxxxxxxxxx> · Mon, 25 Nov 2024 16:46:29 -0600

Hello,

I've been working on a bug regarding slow PCI initialization and BAR
assignment times for Nvidia GPUs passed-through to VMs on our DGX H100
that I originally believed to be an issue in OVMF, but upon further
investigation, I'm now suspecting that it may be an issue somewhere in
the kernel. (Here is the original edk2 mailing list thread, with extra
context: https://edk2.groups.io/g/devel/topic/109651206) [0]

When running the 6.12 kernel on a DGX H100 host with 4 GPUs passed
through using CPU passthrough and this virt-install command[1], VMs
using the latest OVMF version will take around 2 minutes for the guest
kernel to boot and initialize PCI devices/BARs for the GPUs.
Originally, I was investigating this as an issue in OVMF, because GPU
initialization takes much less time when our host is running an OVMF
version with this patch[2] removed (which only calculates the MMIO
window size differently). Without that patch, the guest kernel does
boot quickly, but we can only use the Nvidia GPUs within the guest if
`pci=nocrs pci=realloc` are set in the guest (evidently since the MMIO
windows advertised by OVMF to the kernel without this patch are
incorrect). So, the OVMF patch being present does evidently result in
correct MMIO windows and prevent us from needing those kernel options,
but the VM boot time is much slower.

In discussing this, another contributor reported slow PCIe/BAR
initialization times for large BAR Nvidia GPUs in Linux when using VMs
with SeaBIOS as well. This, combined with me not seeing any slowness
when these GPUs are initialized on the host, and the fact that this
slowness only happens when CPU passthrough is used, are leading me to
suspect that this may actually be a problem somewhere in the KVM or
vfio-pci stack. I did also attempt manually setting different MMIO
window sizes using the X-PciMmio64Mb OVMF/QEMU knob, and it seems that
any MMIO window size large enough to accommodate all GPU memory
regions does result in this slower initialization time (but also a
valid mapping).

I did some profiling of the guest kernel during boot, and I've
identified that it seems like the most time is spent in this
pci_write_config_word() call in __pci_read_base() of
drivers/pci/probe.c.[3] Each of those pci_write_config_word() calls
takes about 2 seconds, but it adds up to a significant chunk of the
initialization time since __pci_read_base() is executed somewhere
between 20-40 times in my VM boot.

As a point of comparison, I measured the time it took to hot-unplug
and re-plug these GPUs after the VM booted, and observed much more
reasonable times (under 5s for each GPU to re-initialize its memory
regions). I've also been trying to get this hotplugging working in VMs
where the GPUs aren't initially attached at boot, but in any such
configuration, the memory regions for the PCI slots that the GPUs get
attached to during hotplug are too small for the full 128GB these GPUs
require (and I have yet to figure out a way to fix that. More details
on that in [0]).

I'm wondering if any other users of Nvidia GPUs or other large BAR
GPUs in VMs with GPU and CPU passthrough have reported similar
slowness during boot, and if anyone has any insight. If you also
suspect this might be a kernel issue, and if there is anything I can
provide to help identify the root causes in that case, please let me
know.

[0] https://edk2.groups.io/g/devel/topic/109651206

[1]
virt-install --name 4gpu-vm-2 --vcpus vcpus=16,maxvcpus=16 --memory
943616 --numatune 0,mode=strict --iothreads
1,iothreadids.iothread0.id=1 --cputune
emulatorpin.cpuset=55,167,iothreadpin0.iothread=1,iothreadpin0.cpuset=54,166,vcpupin0.vcpu=0,vcpupin0.cpuset=16,vcpupin1.vcpu=1,vcpupin1.cpuset=128,vcpupin2.vcpu=2,vcpupin2.cpuset=17,vcpupin3.vcpu=3,vcpupin3.cpuset=129,vcpupin4.vcpu=4,vcpupin4.cpuset=18,vcpupin5.vcpu=5,vcpupin5.cpuset=130,vcpupin6.vcpu=6,vcpupin6.cpuset=19,vcpupin7.vcpu=7,vcpupin7.cpuset=131,vcpupin8.vcpu=8,vcpupin8.cpuset=20,vcpupin9.vcpu=9,vcpupin9.cpuset=132,vcpupin10.vcpu=10,vcpupin10.cpuset=21,vcpupin11.vcpu=11,vcpupin11.cpuset=133,vcpupin12.vcpu=12,vcpupin12.cpuset=22,vcpupin13.vcpu=13,vcpupin13.cpuset=134,vcpupin14.vcpu=14,vcpupin14.cpuset=23,vcpupin15.vcpu=15,vcpupin15.cpuset=135
--os-variant ubuntu22.04 --graphics none --noautoconsole --boot
loader=/usr/share/OVMF/OVMF_CODE_4M.fd,loader_ro=yes,loader_type=pflash
--console pty,target_type=serial --network network:default --network
network:private-net --import --disk
path=/var/lib/libvirt/images/4gpu-vm-2.qcow2,format=qcow2,driver.queues=16,driver.iothread=1
--host-device 1b:00.0,address.type=pci --host-device
61:00.0,address.type=pci --host-device c3:00.0,address.type=pci
--host-device df:00.0,address.type=pci

[2] https://github.com/tianocore/edk2/commit/ecb778d0ac62560aa172786ba19521f27bc3f650

[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/probe.c?h=v6.12#n251

Thanks,
-- 
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering