On Thu, Apr 04, 2019 at 10:40:39AM -0300, Daniel Henrique Barboza wrote: > The NVIDIA V100 GPU has an onboard RAM that is mapped into the > host memory and accessible as normal RAM via an NVLink2 bridge. When > passed through in a guest, QEMU puts the NVIDIA RAM window in a > non-contiguous area, above the PCI MMIO area that starts at 32TiB. > This means that the NVIDIA RAM window starts at 64TiB and go all the > way to 128TiB. > > This means that the guest might request a 64-bit window, for each PCI > Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM > window isn't counted as regular RAM, thus this window is considered > only for the allocation of the Translation and Control Entry (TCE). > For more information about how NVLink2 support works in QEMU, > refer to the accepted implementation [1]. > > This memory layout differs from the existing VFIO case, requiring its > own formula. This patch changes the PPC64 code of > @qemuDomainGetMemLockLimitBytes to: > > - detect if we have a NVLink2 bridge being passed through to the > guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function > added in the previous patch. The existence of the NVLink2 bridge in > the guest means that we are dealing with the NVLink2 memory layout; > > - if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a > different way to account for the extra memory the TCE table can alloc. > The 64TiB..128TiB window is more than enough to fit all possible > GPUs, thus the memLimit is the same regardless of passing through 1 or > multiple V100 GPUs. > > [1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html For further explanation, I'll also add Alexey's responses on libvirt list: https://www.redhat.com/archives/libvir-list/2019-March/msg00660.html https://www.redhat.com/archives/libvir-list/2019-April/msg00527.html ... > + * passthroughLimit = maxMemory + > + * 128TiB/512KiB * #PHBs + 8 MiB */ > + if (nvlink2Capable) { > + passthroughLimit = maxMemory + > + 128 * (1ULL<<30) / 512 * nPCIHostBridges + > + 8192; > + } else if (usesVFIO) { > + /* For regular (non-NVLink1 present) VFIO passthrough, the value Shouldn't ^this be "non-NVLink2 present", since the limits are unchanged except you need to assign the bridges too for NVLink1? > + * of passthroughLimit is: > + * > + * passthroughLimit := max( 2 GiB * #PHBs, (c) > + * memory (d) > + * + memory * 1/512 * #PHBs + 8 MiB ) (e) > + * > + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 1 > + * GiB rather than 0 GiB > + * > + * (d) is the with-DDW (and memory pre-registration and related > + * features) DMA window accounting - assuming that we only account > + * RAM once, even if mapped to multiple PHBs > + * > + * (e) is the with-DDW userspace view and overhead for the 63-bit > + * DMA window. This is based a bit on expected guest behaviour, but > + * there really isn't a way to completely avoid that. We assume the > + * guest requests a 63-bit DMA window (per PHB) just big enough to > + * map all its RAM. 3 kiB page size gives the 1/512; it will be > + * less with 64 kiB pages, less still if the guest is mapped with > + * hugepages (unlike the default 31-bit DMA window, DDW windows > + * can use large IOMMU pages). 7 MiB is for second and further level > + * overheads, like (b) */ > passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, > memory + > memory / 512 * nPCIHostBridges + 8192); > + } > > memKB = baseLimit + passthroughLimit; > Let me know about the commentary above whether I need to adjust before pushing: Reviewed-by: Erik Skultety <eskultet@xxxxxxxxxx> -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list