On Tue, Mar 12, 2019 at 18:55:50 -0300, Daniel Henrique Barboza wrote: > The NVIDIA V100 GPU has an onboard RAM that is mapped into the > host memory and accessible as normal RAM via an NVLink2 bridge. When > passed through in a guest, QEMU puts the NVIDIA RAM window in a > non-contiguous area, above the PCI MMIO area that starts at 32TiB. > This means that the NVIDIA RAM window starts at 64TiB and go all the > way to 128TiB. > > This means that the guest might request a 64-bit window, for each PCI > Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM > window isn't counted as regular RAM, thus this window is considered > only for the allocation of the Translation and Control Entry (TCE). > > This memory layout differs from the existing VFIO case, requiring its > own formula. This patch changes the PPC64 code of > @qemuDomainGetMemLockLimitBytes to: > > - detect if we have a NVLink2 bridge being passed through to the > guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function > added in the previous patch. The existence of the NVLink2 bridge in > the guest means that we are dealing with the NVLink2 memory layout; > > - if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a > different way to account for the extra memory the TCE table can alloc. > The 64TiB..128TiB window is more than enough to fit all possible > GPUs, thus the memLimit is the same regardless of passing through 1 or > multiple V100 GPUs. > > Signed-off-by: Daniel Henrique Barboza <danielhb413@xxxxxxxxx> > --- > src/qemu/qemu_domain.c | 42 ++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 40 insertions(+), 2 deletions(-) > > diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c > index dcc92d253c..6d1a69491d 100644 > --- a/src/qemu/qemu_domain.c > +++ b/src/qemu/qemu_domain.c > @@ -10443,7 +10443,10 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) > unsigned long long maxMemory = 0; > unsigned long long passthroughLimit = 0; > size_t i, nPCIHostBridges = 0; > + virPCIDeviceAddressPtr pciAddr; > + char *pciAddrStr = NULL; > bool usesVFIO = false; > + bool nvlink2Capable = false; > > for (i = 0; i < def->ncontrollers; i++) { > virDomainControllerDefPtr cont = def->controllers[i]; > @@ -10461,7 +10464,16 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) > dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && > dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { > usesVFIO = true; > - break; > + > + pciAddr = &dev->source.subsys.u.pci.addr; > + if (virPCIDeviceAddressIsValid(pciAddr, false)) { > + pciAddrStr = virPCIDeviceAddressAsString(pciAddr); Again this leaks the PCI address string on every iteration and on exit from this function. > + if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) { > + nvlink2Capable = true; > + break; > + } > + } > + > } > } > > @@ -10488,6 +10500,32 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) > 4096 * nPCIHostBridges + > 8192; > > + /* NVLink2 support in QEMU is a special case of the passthrough > + * mechanics explained in the usesVFIO case below. The GPU RAM > + * is placed with a gap after maxMemory. The current QEMU > + * implementation puts the NVIDIA RAM above the PCI MMIO, which > + * starts at 32TiB and is the MMIO reserved for the guest main RAM. > + * > + * This window ends at 64TiB, and this is where the GPUs are being > + * placed. The next available window size is at 128TiB, and > + * 64TiB..128TiB will fit all possible NVIDIA GPUs. > + * > + * The same assumption as the most common case applies here: > + * the guest will request a 64-bit DMA window, per PHB, that is > + * big enough to map all its RAM, which is now at 128TiB due > + * to the GPUs. > + * > + * Note that the NVIDIA RAM window must be accounted for the TCE > + * table size, but *not* for the main RAM (maxMemory). This gives > + * us the following passthroughLimit for the NVLink2 case: Citation needed. Please link a source for these claims. We have some sources for claims on x86_64 even if they are not exactly scientific. > + * > + * passthroughLimit = maxMemory + > + * 128TiB/512KiB * #PHBs + 8 MiB */ > + if (nvlink2Capable) Please add curly braces to this condition as it's multi-line and also has big comment inside of it. > + passthroughLimit = maxMemory + > + 128 * (1ULL<<30) / 512 * nPCIHostBridges + > + 8192; I don't quite understand why this formula uses maxMemory while the vfio case uses just 'memory'. > + > /* passthroughLimit := max( 2 GiB * #PHBs, (c) > * memory (d) > * + memory * 1/512 * #PHBs + 8 MiB ) (e) > @@ -10507,7 +10545,7 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) > * kiB pages, less still if the guest is mapped with hugepages (unlike > * the default 32-bit DMA window, DDW windows can use large IOMMU > * pages). 8 MiB is for second and further level overheads, like (b) */ > - if (usesVFIO) > + else if (usesVFIO) So can't there be a case when a nvlink2 device is present but also e.g. vfio network cards? > passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, > memory + > memory / 512 * nPCIHostBridges + 8192); Also add curly braces here when you are at it. > -- > 2.20.1 > > -- > libvir-list mailing list > libvir-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/libvir-list
Attachment:
signature.asc
Description: PGP signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list