> From: ankita@xxxxxxxxxx <ankita@xxxxxxxxxx> > Sent: Friday, February 16, 2024 11:01 AM > > From: Ankit Agrawal <ankita@xxxxxxxxxx> > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > for the on-chip GPU that is the logical OS representation of the > internal proprietary chip-to-chip cache coherent interconnect. > > The device is peculiar compared to a real PCI device in that whilst > there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the > device, it is not used to access device memory once the faster > chip-to-chip interconnect is initialized (occurs at the time of host > system boot). The device memory is accessed instead using the chip-to-chip > interconnect that is exposed as a contiguous physically addressable > region on the host. This device memory aperture can be obtained from host > ACPI table using device_property_read_u64(), according to the FW > specification. Since the device memory is cache coherent with the CPU, > it can be mmap into the user VMA with a cacheable mapping using > remap_pfn_range() and used like a regular RAM. The device memory > is not added to the host kernel, but mapped directly as this reduces > memory wastage due to struct pages. > > There is also a requirement of a minimum reserved 1G uncached region > (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. > This is to work around a HW defect. Based on [2], the requisite properties > (uncached, unaligned access) can be achieved through a VM mapping (S1) > of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide > a different non-cached property to the reserved 1G region, it needs to > be carved out from the device memory and mapped as a separate region > in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the > Qemu VMA page properties (pgprot) as NORMAL_NC. > > Provide a VFIO PCI variant driver that adapts the unique device memory > representation into a more standard PCI representation facing userspace. > > The variant driver exposes these two regions - the non-cached reserved > (resmem) and the cached rest of the device memory (termed as usemem) as > separate VFIO 64b BAR regions. This is divergent from the baremetal > approach, where the device memory is exposed as a device memory region. > The decision for a different approach was taken in view of the fact that > it would necessiate additional code in Qemu to discover and insert those > regions in the VM IPA, along with the additional VM ACPI DSDT changes to > communicate the device memory region IPA to the VM workloads. Moreover, > this behavior would have to be added to a variety of emulators (beyond > top of tree Qemu) out there desiring grace hopper support. > > Since the device implements 64-bit BAR0, the VFIO PCI variant driver > maps the uncached carved out region to the next available PCI BAR (i.e. > comprising of region 2 and 3). The cached device memory aperture is > assigned BAR region 4 and 5. Qemu will then naturally generate a PCI > device in the VM with the uncached aperture reported as BAR2 region, > the cacheable as BAR4. The variant driver provides emulation for these > fake BARs' PCI config space offset registers. > > The hardware ensures that the system does not crash when the memory > is accessed with the memory enable turned off. It synthesis ~0 reads > and dropped writes on such access. So there is no need to support the > disablement/enablement of BAR through PCI_COMMAND config space > register. > > The memory layout on the host looks like the following: > devmem (memlength) > |--------------------------------------------------| > |-------------cached------------------------|--NC--| > | | > usemem.memphys resmem.memphys > > PCI BARs need to be aligned to the power-of-2, but the actual memory on the > device may not. A read or write access to the physical address from the > last device PFN up to the next power-of-2 aligned physical address > results in reading ~0 and dropped writes. Note that the GPU device > driver [6] is capable of knowing the exact device memory size through > separate means. The device memory size is primarily kept in the system > ACPI tables for use by the VFIO PCI variant module. > > Note that the usemem memory is added by the VM Nvidia device driver [5] > to the VM kernel as memblocks. Hence make the usable memory size > memblock > (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW > and > VFIO driver. The VM device driver make use of the same value for its > calculation to determine USEMEM size. > > Currently there is no provision in KVM for a S2 mapping with > MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. > As previously mentioned, resmem is mapped pgprot_writecombine(), that > sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the > proposed changes in [3] and [4], KVM marks the region with > MemAttr[2:0]=0b101 in S2. > > If the device memory properties are not present, the driver registers the > vfio-pci-core function pointers. Since there are no ACPI memory properties > generated for the VM, the variant driver inside the VM will only use > the vfio-pci-core ops and hence try to map the BARs as non cached. This > is not a problem as the CPUs have FWB enabled which blocks the VM > mapping's ability to override the cacheability set by the host mapping. > > This goes along with a qemu series [6] to provides the necessary > implementation of the Grace Hopper Superchip firmware specification so > that the guest operating system can see the correct ACPI modeling for > the coherent GPU device. Verified with the CUDA workload in the VM. > > [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ > [2] section D8.5.5 of > https://developer.arm.com/documentation/ddi0487/latest/ > [3] https://lore.kernel.org/all/20240211174705.31992-1-ankita@xxxxxxxxxx/ > [4] https://lore.kernel.org/all/20230907181459.18145-2-ankita@xxxxxxxxxx/ > [5] https://github.com/NVIDIA/open-gpu-kernel-modules > [6] https://lore.kernel.org/all/20231203060245.31593-1-ankita@xxxxxxxxxx/ > > Signed-off-by: Aniket Agashe <aniketa@xxxxxxxxxx> > Signed-off-by: Ankit Agrawal <ankita@xxxxxxxxxx> Reviewed-by: Kevin Tian <kevin.tian@xxxxxxxxx>