From: Ankit Agrawal <ankita@xxxxxxxxxx> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device for the on-chip GPU that is the logical OS representation of the internal proprietary chip-to-chip cache coherent interconnect. The device is peculiar compared to a real PCI device in that whilst there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the device, it is not used to access device memory once the faster chip-to-chip interconnect is initialized (occurs at the time of host system boot). The device memory is accessed instead using the chip-to-chip interconnect that is exposed as a contiguous physically addressable region on the host. Since the device memory is cache coherent with the CPU, it can be mmaped into the user VMA with a cacheable mapping and used like a regular RAM. The device memory is not added to the host kernel, but mapped directly as this reduces memory wastage due to struct pages. There is also a requirement of a reserved 1G uncached region (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is to work around a HW defect. Based on [2], the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide a different non-cached property to the reserved 1G region, it needs to be carved out from the device memory and mapped as a separate region in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Provide a VFIO PCI variant driver that adapts the unique device memory representation into a more standard PCI representation facing userspace. The variant driver exposes these two regions - the non-cached reserved (resmem) and the cached rest of the device memory (termed as usemem) as separate VFIO 64b BAR regions. This is divergent from the baremetal approach, where the device memory is exposed as a device memory region. The decision for a different approach was taken in view of the fact that it would necessiate additional code in Qemu to discover and insert those regions in the VM IPA, along with the additional VM ACPI DSDT changes to communiate the device memory region IPA to the VM workloads. Moreover, this behavior would have to be added to a variety of emulators (beyond top of tree Qemu) out there desiring grace hopper support. Since the device implements 64-bit BAR0, the VFIO PCI variant driver maps the uncached carved out region to the next available PCI BAR (i.e. comprising of region 2 and 3). The cached device memory aperture is assigned BAR region 4 and 5. Qemu will then naturally generate a PCI device in the VM with the uncached aperture reported as BAR2 region, the cacheable as BAR4. The variant driver provides emulation for these fake BARs' PCI config space offset registers. The hardware ensures that the system does not crash when the memory is accessed with the memory enable turned off. It synthesis ~0 reads and dropped writes on such access. So there is no need to support the disablement/enablement of BAR through PCI_COMMAND config space register. The memory layout on the host looks like the following: devmem (memlength) |--------------------------------------------------| |-------------cached------------------------|--NC--| | | usemem.phys/memphys resmem.phys PCI BARs need to be aligned to the power-of-2, but the actual memory on the device may not. A read or write access to the physical address from the last device PFN up to the next power-of-2 aligned physical address results in reading ~0 and dropped writes. Note that the GPU device driver [6] is capable of knowing the exact device memory size through separate means. The device memory size is primarily kept in the system ACPI tables for use by the VFIO PCI variant module. Note that the usemem memory is added by the VM Nvidia device driver [5] to the VM kernel as memblocks. Hence make the usable memory size memblock aligned. Currently there is no provision in KVM for a S2 mapping with MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. As previously mentioned, resmem is mapped pgprot_writecombine(), that sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM marks the region with MemAttr[2:0]=0b101 in S2. If the device memory properties are not present in the host ACPI table, the driver registers the vfio-pci-core function pointers. This goes along with a qemu series [6] to provides the necessary implementation of the Grace Hopper Superchip firmware specification so that the guest operating system can see the correct ACPI modeling for the coherent GPU device. Verified with the CUDA workload in the VM. [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [2] section D8.5.5 of https://developer.arm.com/documentation/ddi0487/latest/ [3] https://lore.kernel.org/all/20231205033015.10044-1-ankita@xxxxxxxxxx/ [4] https://lore.kernel.org/all/20230907181459.18145-2-ankita@xxxxxxxxxx/ [5] https://github.com/NVIDIA/open-gpu-kernel-modules [6] https://lore.kernel.org/all/20231203060245.31593-1-ankita@xxxxxxxxxx/ Applied over v6.8-rc2. Signed-off-by: Aniket Agashe <aniketa@xxxxxxxxxx> Signed-off-by: Ankit Agrawal <ankita@xxxxxxxxxx> --- Link for variant driver v16: https://lore.kernel.org/all/20240115211516.635852-1-ankita@xxxxxxxxxx/ v16 -> v17 - Moved, renamed and exported the range_intersect_range() per suggestion from Rahul Rameshbabu. - Updated license from GPLv2 to GPL. - Fixed S-O-B mistakes. - Removed nvgrace_gpu_vfio_pci.h based on Alex Williamson's suggestion. - Refactor [read]write_config_emu based on Alex's suggestion - Added fallback to vfio-pci-core function pointers in case of absence of memory properties in the host ACPI table as per Alex's suggestion. - Used anonymous union to represent the mapped device memory. - Fixed code nits and rephrased comments. - Rebased to v6.8-rc2. v15 -> v16 - Added the missing header file causing build failure in v15. - Moved the range_intersect_range function() to a seperate patch. - Exported the do_io_rw as GPL and moved to the vfio-pci-core file. - Added helper function to mask with BAR size and add flag while returning a read on the fake BARs PCI config register. - Removed the PCI command disable. - Removed nvgrace_gpu_vfio_pci_fake_bar_mem_region(). - Fixed miscellaneous nits. v14 -> v15 - Added case to handle VFIO_DEVICE_IOEVENTFD to return -EIO as it is not required on the device. - Updated the BAR config space handling code to closely resemble by Yishai Hadas (using range_intersect_range) in https://lore.kernel.org/all/20231207102820.74820-10-yishaih@xxxxxxxxxx - Changed the bar pci config register from union to u64. - Adapted the code to disable BAR when it is disabled through PCI_COMMAND. - Exported and reused the do_io_rw to do mmio accesses. - Added a new header file to keep the newly declared structures. - Miscellaneous code fixes suggested by Alex Williamson in v14. v13 -> v14 - Merged the changes for second BAR implementation for MIG support on the device driver. https://lore.kernel.org/all/20231115080751.4558-1-ankita@xxxxxxxxxx/ - Added the missing implementation of sub-word access to fake BARs' PCI config access. Implemented access algorithm suggested by Alex Williamson in the comments (Thanks!) - Added support to BAR accesses on the reserved memory with Qemu device param x-no-mmap=on. - Handled endian-ness in the PCI config space access. - Git commit message change v12 -> v13 - Added emulation for the PCI config space BAR offset register for the fake BAR. - commit message updated with more details on the BAR offset emulation. v11 -> v12 - More details in commit message on device memory size v10 -> v11 - Removed sysfs attribute to expose the CPU coherent memory feature - Addressed review comments v9 -> v10 - Add new sysfs attribute to expose the CPU coherent memory feature. v8 -> v9 - Minor code adjustment suggested in v8. v7 -> v8 - Various field names updated. - Added a new function to handle VFIO_DEVICE_GET_REGION_INFO ioctl. - Locking protection for memremap to bar region and other changes recommended in v7. - Added code to fail if the devmem size advertized is 0 in system DSDT. v6 -> v7 - Handled out-of-bound and overflow conditions at various places to validate input offset and length. - Added code to return EINVAL for offset beyond region size. v5 -> v6 - Added the code to handle BAR2 read/write using memremap to the device memory. v4 -> v5 - Changed the module name from nvgpu-vfio-pci to nvgrace-gpu-vfio-pci. v3 -> v4 - Mapping the available device memory using sparse mmap. The region outside the device memory is handled by read/write ops. - Removed the fault handler added in v3. v2 -> v3 - Added fault handler to map the region outside the physical GPU memory up to the next power-of-2 to a dummy PFN. - Changed to select instead of "depends on" VFIO_PCI_CORE for all the vfio-pci variant driver. - Code cleanup based on feedback comments. - Code implemented and tested against v6.4-rc4. v1 -> v2 - Updated the wording of reference to BAR offset and replaced with index. - The GPU memory is exposed at the fixed BAR2_REGION_INDEX. - Code cleanup based on feedback comments. Ankit Agrawal (3): vfio/pci: rename and export do_io_rw() vfio/pci: rename and export range_intesect_range vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper MAINTAINERS | 6 + drivers/vfio/pci/Kconfig | 2 + drivers/vfio/pci/Makefile | 2 + drivers/vfio/pci/nvgrace-gpu/Kconfig | 10 + drivers/vfio/pci/nvgrace-gpu/Makefile | 3 + drivers/vfio/pci/nvgrace-gpu/main.c | 856 ++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_config.c | 45 ++ drivers/vfio/pci/vfio_pci_rdwr.c | 16 +- drivers/vfio/pci/virtio/main.c | 72 +-- include/linux/vfio_pci_core.h | 10 +- 10 files changed, 968 insertions(+), 54 deletions(-) create mode 100644 drivers/vfio/pci/nvgrace-gpu/Kconfig create mode 100644 drivers/vfio/pci/nvgrace-gpu/Makefile create mode 100644 drivers/vfio/pci/nvgrace-gpu/main.c -- 2.34.1