From: Ankit Agrawal <ankita@xxxxxxxxxx> Grace based platforms such as Grace Hopper/Blackwell Superchips have CPU accessible cache coherent GPU memory. The GPU device memory is essentially a DDR memory and retains properties such as cacheability, unaligned accesses, atomics and handling of executable faults. This requires the device memory to be mapped as NORMAL in stage-2. Today KVM forces the memory to either NORMAL or DEVICE_nGnRE depending on whethere the memory region is added to the kernel. The KVM code is thus restrictive and prevents device memory that is not added to the kernel to be marked as cacheable. The patch aims to solve this. A cachebility check is made if the VM_PFNMAP is set in VMA flags by consulting the VMA pgprot value. If the pgprot mapping type is MT_NORMAL, it is considered safe to be mapped cacheable as the KVM S2 will have the same Normal memory type as the VMA has in the S1 and KVM has no additional responsibility for safety. Note when FWB is not enabled, the kernel expects to trivially do cache management by flushing the memory by linearly converting a kvm_pte to phys_addr to a KVA. The cache management thus relies on memory being mapped. Since the GPU device memory is not kernel mapped, exit when the FWB is not supported. The changes are heavily influenced by the discussions between Catalin Marinas, David Hildenbrand and Jason Gunthorpe [1] on v2. Many thanks for their valuable suggestions. Applied over next-20250305 and tested on the Grace Hopper and Grace Blackwell platforms by booting up VM, loading NVIDIA module [2] and running nvidia-smi in the VM. To run CUDA workloads, there is a dependency on the IOMMUFD and the Nested Page Table patches being worked on separately by Nicolin Chen. (nicolinc@xxxxxxxxxx). NVIDIA has provided git repositories which includes all the requisite kernel [3] and Qemu [4] patches in case one wants to try. v2 -> v3 1. Restricted the new changes to check for cacheability to VM_PFNMAP based on David Hildenbrand's (david@xxxxxxxxxx) suggestion. 2. Removed the MTE checks based on Jason Gunthorpe's (jgg@xxxxxxxxxx) observation that it already done earlier in kvm_arch_prepare_memory_region. 3. Dropped the pfn_valid() checks based on suggestions by Catalin Marinas (catalin.marinas@xxxxxxx). 4. Removed the code for exec fault handling as it is not needed anymore. v1 -> v2 1. Removed kvm_is_device_pfn() as a determiner for device type memory determination. Instead using pfn_valid() 2. Added handling for MTE. 3. Minor cleanup. Link: https://lore.kernel.org/all/20241118131958.4609-1-ankita@xxxxxxxxxx/ [1] Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2] Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [3] Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [4] Ankit Agrawal (1): KVM: arm64: Allow cacheable stage 2 mapping using VMA flags arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++ arch/arm64/kvm/hyp/pgtable.c | 2 +- arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++++++- 3 files changed, 51 insertions(+), 2 deletions(-) -- 2.34.1