Thanks for the response, here's a link to the entire dmesg log: https://drive.google.com/file/d/1Uau0cgd2ymYGDXNr1mA9X_UdLoMH_Azn/view Some entries that might be of interest: [ 0.378712] pci_bus 0000:00: root bus resource [io 0x0000-0xffff] [ 0.378712] pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffffff] [ 0.378712] pci_bus 0000:00: root bus resource [bus 00-ff] ... For GPU 1 on bus 01:00.0 the process goes like this: [ 0.676903] pci 0000:01:00.0: [10de:20b2] type 00 class 0x030200 [ 0.677433] pci 0000:01:00.0: reg 0x10: [mem 0xff000000-0xffffffff] [ 0.677551] pci 0000:01:00.0: reg 0x14: [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref] [ 0.677668] pci 0000:01:00.0: reg 0x1c: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref] ... [ 1.416980] pci 0000:01:00.0: can't claim BAR 0 [mem 0xff000000-0xffffffff]: no compatible bridge window [ 1.416983] pci 0000:01:00.0: can't claim BAR 1 [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window [ 1.416986] pci 0000:01:00.0: can't claim BAR 3 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window .... [ 1.445156] pci 0000:01:00.0: BAR 1: assigned [mem 0x2000000000-0x3fffffffff 64bit pref] [ 1.445380] pci 0000:01:00.0: BAR 3: assigned [mem 0x1000000000-0x1001ffffff 64bit pref] [ 1.445589] pci 0000:01:00.0: BAR 0: assigned [mem 0xdb000000-0xdbffffff] GPU 5 on bus 05:00.0 seems to have taken the last available window for BAR 1 and 3: [ 1.461179] pci 0000:05:00.0: BAR 1: assigned [mem 0xe000000000-0xffffffffff 64bit pref] [ 1.461361] pci 0000:05:00.0: BAR 3: assigned [mem 0xd000000000-0xd001ffffff 64bit pref] [ 1.461533] pci 0000:05:00.0: BAR 0: assigned [mem 0xdf000000-0xdfffffff] The last step fails for GPU on bus 06:00.0: [ 1.463503] pci 0000:06:00.0: BAR 1: no space for [mem size 0x2000000000 64bit pref] [ 1.463508] pci 0000:06:00.0: BAR 1: trying firmware assignment [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref] [ 1.463511] pci 0000:06:00.0: BAR 1: [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff] [ 1.463514] pci 0000:06:00.0: BAR 1: failed to assign [mem size 0x2000000000 64bit pref] [ 1.463517] pci 0000:06:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref] [ 1.463519] pci 0000:06:00.0: BAR 3: trying firmware assignment [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref] [ 1.463522] pci 0000:06:00.0: BAR 3: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff] [ 1.463525] pci 0000:06:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref] [ 1.463527] pci 0000:06:00.0: BAR 0: assigned [mem 0xe0000000-0xe0ffffff] If I understand correctly, it looks like the bridge window is 40 bits or 1024GB? BAR 3 takes only a small section but BAR 1 skips to the next 64GB block to take 128GB, BAR3 of GPU1 starts at 0x1000000000 so by the time we get to GPU 6 the 1024GB is used up it seems. It seems that increasing the window size would solve the issue at hand, however I haven't got a clue where to start. Thanks for your input so far, greatly appreciated! Op do 15 jul. 2021 om 17:49 schreef Bjorn Helgaas <helgaas@xxxxxxxxxx>: > > On Thu, Jul 15, 2021 at 01:43:17AM +0300, Ruben wrote: > > No luck so far with "-global q35-pcihost.pci-hole64-size=2048G" > > ("-global q35-host.pci-hole64-size=" gave an error "warning: global > > q35-host.pci-hole64-size has invalid class name"). > > The result stays the same. > > Alex will have to chime in about the qemu option problem. > > Your dmesg excerpts don't include the host bridge window info, e.g., > "root bus resource [mem 0x7f800000-0xefffffff window]". That tells > you what PCI thinks is available for devices. This info comes from > ACPI, and I don't know whether the BIOS on qemu is smart enough to > compute it based on "q35-host.pci-hole64-size=". But dmesg will tell > you. > > "pci=nocrs" tells the kernel to ignore those windows from ACPI and > pretend everything that's not RAM is available for devices. Of > course, that's not true in general, so it's not really safe. > > PCI resources are hierarchical: an endpoint BAR must be contained > in the Root Ports window, which must in turn be contained in the host > bridge window. You trimmed most of that information out from your > dmesg log, so we can't see exactly what's wrong. > > > When we pass through the NVLink bridges we can have the (5 working) > > GPUs talk at full P2P bandwidth and is described in the NVidia docs as > > a valid option (ie. passing through all GPUs and NVlink bridges). > > In production we have the bridges passed through to a service VM which > > controls traffic, which is also described in their docs. > > > > Op do 15 jul. 2021 om 01:03 schreef Alex Williamson > > <alex.williamson@xxxxxxxxxx>: > > > > > > On Thu, 15 Jul 2021 00:32:30 +0300 > > > Ruben <rubenbryon@xxxxxxxxx> wrote: > > > > > > > I am experiencing an issue with virtualizing a machine which contains > > > > 8 NVidia A100 80GB cards. > > > > As a bare metal host, the machine behaves as expected, the GPUs are > > > > connected to the host with a PLX chip PEX88096, which connects 2 GPUs > > > > to 16 lanes on the CPU (using the same NVidia HGX Delta baseboard). > > > > When passing through all GPUs and NVLink bridges to a VM, a problem > > > > arises in that the system can only initialize 4-5 of the 8 GPUs. > > > > > > > > The dmesg log shows failed attempts for assiging BAR space to the GPUs > > > > that are not getting initialized. > > > > > > > > Things that were tried: > > > > Q35-i440fx with and without UEFI > > > > Qemu 5.x, Qemu 6.0 > > > > Host Ubuntu 20.04 host with Qemu/libvirt > > > > Now running proxmox 7 on debian 11, host kernel 5.11.22-2, VM kernel 5.4.0-77 > > > > VM kernel parameters pci=nocrs pci=realloc=on/off > > > > > > > > ------------------------------------ > > > > > > > > lspci -v: > > > > 01:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1) > > > > Memory at db000000 (32-bit, non-prefetchable) [size=16M] > > > > Memory at 2000000000 (64-bit, prefetchable) [size=128G] > > > > Memory at 1000000000 (64-bit, prefetchable) [size=32M] > > > > > > > > 02:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1) > > > > Memory at dc000000 (32-bit, non-prefetchable) [size=16M] > > > > Memory at 4000000000 (64-bit, prefetchable) [size=128G] > > > > Memory at 6000000000 (64-bit, prefetchable) [size=32M] > > > > > > > > ... > > > > > > > > 0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1) > > > > Memory at e0000000 (32-bit, non-prefetchable) [size=16M] > > > > Memory at <ignored> (64-bit, prefetchable) > > > > Memory at <ignored> (64-bit, prefetchable) > > > > > > > > ... > > > > > > > ... > > > > > > > > ------------------------------------ > > > > > > > > I have (blindly) messed with parameters like pref64-reserve for the > > > > pcie-root-port but to be frank I have little clue what I'm doing so my > > > > question would be suggestions on what I can try. > > > > This server will not be running an 8 GPU VM in production but I have a > > > > few days left to test before it goes to work. I was hoping to learn > > > > how to overcome this issue in the future. > > > > Please be aware that my knowledge regarding virtualization and the > > > > Linux kernel does not reach far. > > > > > > Try playing with the QEMU "-global q35-host.pci-hole64-size=" option for > > > the VM rather than pci=nocrs. The default 64-bit MMIO hole for > > > QEMU/q35 is only 32GB. You might be looking at a value like 2048G to > > > support this setup, but could maybe get away with 1024G if there's room > > > in 32-bit space for the 3rd BAR. > > > > > > Note that assigning bridges usually doesn't make a lot of sense and > > > NVLink is a proprietary black box, so we don't know how to virtualize > > > it or what the guest drivers will do with it, you're on your own there. > > > We generally recommend to use vGPUs for such cases so the host driver > > > can handle all the NVLink aspects for GPU peer-to-peer. Thanks, > > > > > > Alex > > >