Re: [PATCH v4 2/2] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough

Peter Krempa <pkrempa@xxxxxxxxxx> · Tue, 2 Apr 2019 09:37:56 +0200

On Tue, Mar 12, 2019 at 18:55:50 -0300, Daniel Henrique Barboza wrote:
> The NVIDIA V100 GPU has an onboard RAM that is mapped into the
> host memory and accessible as normal RAM via an NVLink2 bridge. When
> passed through in a guest, QEMU puts the NVIDIA RAM window in a
> non-contiguous area, above the PCI MMIO area that starts at 32TiB.
> This means that the NVIDIA RAM window starts at 64TiB and go all the
> way to 128TiB.
> 
> This means that the guest might request a 64-bit window, for each PCI
> Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM
> window isn't counted as regular RAM, thus this window is considered
> only for the allocation of the Translation and Control Entry (TCE).
> 
> This memory layout differs from the existing VFIO case, requiring its
> own formula. This patch changes the PPC64 code of
> @qemuDomainGetMemLockLimitBytes to:
> 
> - detect if we have a NVLink2 bridge being passed through to the
> guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function
> added in the previous patch. The existence of the NVLink2 bridge in
> the guest means that we are dealing with the NVLink2 memory layout;
> 
> - if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a
> different way to account for the extra memory the TCE table can alloc.
> The 64TiB..128TiB window is more than enough to fit all possible
> GPUs, thus the memLimit is the same regardless of passing through 1 or
> multiple V100 GPUs.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@xxxxxxxxx>
> ---
>  src/qemu/qemu_domain.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
> index dcc92d253c..6d1a69491d 100644
> --- a/src/qemu/qemu_domain.c
> +++ b/src/qemu/qemu_domain.c
> @@ -10443,7 +10443,10 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
>      unsigned long long maxMemory = 0;
>      unsigned long long passthroughLimit = 0;
>      size_t i, nPCIHostBridges = 0;
> +    virPCIDeviceAddressPtr pciAddr;
> +    char *pciAddrStr = NULL;
>      bool usesVFIO = false;
> +    bool nvlink2Capable = false;
>  
>      for (i = 0; i < def->ncontrollers; i++) {
>          virDomainControllerDefPtr cont = def->controllers[i];
> @@ -10461,7 +10464,16 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
>              dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI &&
>              dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) {
>              usesVFIO = true;
> -            break;
> +
> +            pciAddr = &dev->source.subsys.u.pci.addr;
> +            if (virPCIDeviceAddressIsValid(pciAddr, false)) {
> +                pciAddrStr = virPCIDeviceAddressAsString(pciAddr);

Again this leaks the PCI address string on every iteration and on exit
from this function.

> +                 if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) {
> +                    nvlink2Capable = true;
> +                    break;
> +                }
> +            }
> +
>          }
>      }
>  
> @@ -10488,6 +10500,32 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
>                  4096 * nPCIHostBridges +
>                  8192;
>  
> +    /* NVLink2 support in QEMU is a special case of the passthrough
> +     * mechanics explained in the usesVFIO case below. The GPU RAM
> +     * is placed with a gap after maxMemory. The current QEMU
> +     * implementation puts the NVIDIA RAM above the PCI MMIO, which
> +     * starts at 32TiB and is the MMIO reserved for the guest main RAM.
> +     *
> +     * This window ends at 64TiB, and this is where the GPUs are being
> +     * placed. The next available window size is at 128TiB, and
> +     * 64TiB..128TiB will fit all possible NVIDIA GPUs.
> +     *
> +     * The same assumption as the most common case applies here:
> +     * the guest will request a 64-bit DMA window, per PHB, that is
> +     * big enough to map all its RAM, which is now at 128TiB due
> +     * to the GPUs.
> +     *
> +     * Note that the NVIDIA RAM window must be accounted for the TCE
> +     * table size, but *not* for the main RAM (maxMemory). This gives
> +     * us the following passthroughLimit for the NVLink2 case:

Citation needed. Please link a source for these claims. We have some
sources for claims on x86_64 even if they are not exactly scientific.

> +     *
> +     * passthroughLimit = maxMemory +
> +     *                    128TiB/512KiB * #PHBs + 8 MiB */
> +    if (nvlink2Capable)

Please add curly braces to this condition as it's multi-line and also
has big comment inside of it.

> +        passthroughLimit = maxMemory +
> +                           128 * (1ULL<<30) / 512 * nPCIHostBridges +
> +                           8192;

I don't quite understand why this formula uses maxMemory while the vfio
case uses just 'memory'.

> +
>      /* passthroughLimit := max( 2 GiB * #PHBs,                       (c)
>       *                          memory                               (d)
>       *                          + memory * 1/512 * #PHBs + 8 MiB )   (e)
> @@ -10507,7 +10545,7 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
>       * kiB pages, less still if the guest is mapped with hugepages (unlike
>       * the default 32-bit DMA window, DDW windows can use large IOMMU
>       * pages). 8 MiB is for second and further level overheads, like (b) */
> -    if (usesVFIO)
> +    else if (usesVFIO)

So can't there be a case when a nvlink2 device is present but also e.g.
vfio network cards?

>          passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges,
>                                 memory +
>                                 memory / 512 * nPCIHostBridges + 8192);

Also add curly braces here when you are at it.

> -- 
> 2.20.1
> 
> --
> libvir-list mailing list
> libvir-list@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/libvir-list
Attachment:
signature.asc

Description: PGP signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list