Re: [question]: BAR allocation failing

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 14 Jul 2021 16:03:50 -0600

On Thu, 15 Jul 2021 00:32:30 +0300
Ruben <rubenbryon@xxxxxxxxx> wrote:

> I am experiencing an issue with virtualizing a machine which contains
> 8 NVidia A100 80GB cards.
> As a bare metal host, the machine behaves as expected, the GPUs are
> connected to the host with a PLX chip PEX88096, which connects 2 GPUs
> to 16 lanes on the CPU (using the same NVidia HGX Delta baseboard).
> When passing through all GPUs and NVLink bridges to a VM, a problem
> arises in that the system can only initialize 4-5 of the 8 GPUs.
> 
> The dmesg log shows failed attempts for assiging BAR space to the GPUs
> that are not getting initialized.
> 
> Things that were tried:
> Q35-i440fx with and without UEFI
> Qemu 5.x, Qemu 6.0
> Host Ubuntu 20.04 host with Qemu/libvirt
> Now running proxmox 7 on debian 11, host kernel 5.11.22-2, VM kernel 5.4.0-77
> VM kernel parameters pci=nocrs pci=realloc=on/off
> 
> ------------------------------------
> 
> lspci -v:
> 01:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
>         Memory at db000000 (32-bit, non-prefetchable) [size=16M]
>         Memory at 2000000000 (64-bit, prefetchable) [size=128G]
>         Memory at 1000000000 (64-bit, prefetchable) [size=32M]
> 
> 02:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
>         Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
>         Memory at 4000000000 (64-bit, prefetchable) [size=128G]
>         Memory at 6000000000 (64-bit, prefetchable) [size=32M]
> 
> ...
> 
> 0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
>         Memory at e0000000 (32-bit, non-prefetchable) [size=16M]
>         Memory at <ignored> (64-bit, prefetchable)
>         Memory at <ignored> (64-bit, prefetchable)
> 
> ...
> 
...
> 
> ------------------------------------
> 
> I have (blindly) messed with parameters like pref64-reserve for the
> pcie-root-port but to be frank I have little clue what I'm doing so my
> question would be suggestions on what I can try.
> This server will not be running an 8 GPU VM in production but I have a
> few days left to test before it goes to work. I was hoping to learn
> how to overcome this issue in the future.
> Please be aware that my knowledge regarding virtualization and the
> Linux kernel does not reach far.

Try playing with the QEMU "-global q35-host.pci-hole64-size=" option for
the VM rather than pci=nocrs.  The default 64-bit MMIO hole for
QEMU/q35 is only 32GB.  You might be looking at a value like 2048G to
support this setup, but could maybe get away with 1024G if there's room
in 32-bit space for the 3rd BAR.

Note that assigning bridges usually doesn't make a lot of sense and
NVLink is a proprietary black box, so we don't know how to virtualize
it or what the guest drivers will do with it, you're on your own there.
We generally recommend to use vGPUs for such cases so the host driver
can handle all the NVLink aspects for GPU peer-to-peer.  Thanks,

Alex