[PATCH v2 3/3] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough

Daniel Henrique Barboza <danielhb413@xxxxxxxxx> · Sun, 3 Mar 2019 10:23:14 -0300

The NVIDIA V100 GPU has an onboard RAM that is mapped into the
host memory and accessible as normal RAM via an NVLink2 bus. When
passed through in a guest, QEMU puts the NVIDIA RAM window in a
non-contiguous area, above the PCI MMIO area that starts at 32TiB.
This means that the NVIDIA RAM window starts at 64TiB and go all the
way to 128TiB.

This means that the guest might request a 64-bit window, for each PCI
Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM
window isn't counted as regular RAM, thus this window is considered
only for the allocation of the Translation and Control Entry (TCE).

This memory layout differs from the existing VFIO case, requiring its
own formula. This patch changes the PPC64 code of
qemuDomainGetMemLockLimitBytes to:

- detect if a VFIO PCI device is using NVLink2 capabilities. This is
done by using the device tree inspection mechanisms that were
implemented in the previous patch;

- if any device is a NVIDIA GPU using a NVLink2 bus, passthroughLimit
is calculated in a different way to account for the extra memory
the TCE table can alloc. The 64TiB..128TiB window is more than
enough to fit all possible GPUs, thus the memLimit is the
same regardless of passing through 1 or multiple V100 GPUs.

Signed-off-by: Daniel Henrique Barboza <danielhb413@xxxxxxxxx>
---
 src/qemu/qemu_domain.c | 44 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index 76e1e4b161..56b45fcfb7 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -10556,7 +10556,9 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def)
         unsigned long long baseLimit;
         unsigned long long passthroughLimit = 0;
         size_t nPCIHostBridges = 0;
-        bool usesVFIO = false;
+        virPCIDeviceAddressPtr pciAddr;
+        char *pciAddrStr = NULL;
+        bool usesVFIO = false, nvlink2Capable = false;
 
         for (i = 0; i < def->ncontrollers; i++) {
             virDomainControllerDefPtr cont = def->controllers[i];
@@ -10573,8 +10575,18 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def)
             if (dev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS &&
                 dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI &&
                 dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) {
+
                 usesVFIO = true;
-                break;
+
+                pciAddr = &dev->source.subsys.u.pci.addr;
+                if (virPCIDeviceAddressIsValid(pciAddr, false)) {
+                    pciAddrStr = virPCIDeviceAddressAsString(pciAddr);
+
+                    if (device_is_nvlink2_capable(pciAddrStr)) {
+                        nvlink2Capable = true;
+                        break;
+                    }
+                }
             }
         }
 
@@ -10601,6 +10613,32 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def)
                     4096 * nPCIHostBridges +
                     8192;
 
+        /* NVLink2 support in QEMU is a special case of the passthrough
+         * mechanics explained in the usesVFIO case below. The GPU RAM
+         * is placed with a gap after maxMemory. The current QEMU
+         * implementation puts the NVIDIA RAM above the PCI MMIO, which
+         * starts at 32TiB and is the MMIO reserved for the guest main RAM.
+         *
+         * This window ends at 64TiB, and this is where the GPUs are being
+         * placed. The next available window size is at 128TiB, and
+         * 64TiB..128TiB will fit all possible NVIDIA GPUs.
+         *
+         * The same assumption as the most common case applies here:
+         * the guest will request a 64-bit DMA window, per PHB, that is
+         * big enough to map all its RAM, which is now at 128TiB due
+         * to the GPUs.
+         *
+         * Note that the NVIDIA RAM window must be accounted for the TCE
+         * table size, but *not* for the main RAM (maxMemory). This gives
+         * us the following passthroughLimit for the NVLink2 case:
+         *
+         * passthroughLimit = maxMemory +
+         *                    128TiB/512KiB * #PHBs + 8 MiB */
+        if (nvlink2Capable)
+            passthroughLimit = maxMemory +
+                               128 * (1ULL<<30) / 512 * nPCIHostBridges +
+                               8192;
+
         /* passthroughLimit := max( 2 GiB * #PHBs,                       (c)
          *                          memory                               (d)
          *                          + memory * 1/512 * #PHBs + 8 MiB )   (e)
@@ -10620,7 +10658,7 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def)
          * kiB pages, less still if the guest is mapped with hugepages (unlike
          * the default 32-bit DMA window, DDW windows can use large IOMMU
          * pages). 8 MiB is for second and further level overheads, like (b) */
-        if (usesVFIO)
+        else if (usesVFIO)
             passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges,
                                    memory +
                                    memory / 512 * nPCIHostBridges + 8192);
-- 
2.20.1

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list