This is a follow-up series to fix the security risk for non-coherent device assignment raised by Jason in [1]. When IOMMU does not enforce cache coherency, devices are allowed to perform non-coherent DMAs (DMAs that lack CPU cache snooping). This scenario poses a risk of information leakage when the device is assigned into a VM. Specifically, a malicious guest could potentially retrieve stale host data through non-coherent DMA reads of physical memory, while data initialized by host (e.g., zeros) still resides in the cache. Furthermore, host kernel (e.g. a ksm thread) might encounter inconsistent data between the CPU cache and physical memory (left by a malicious guest) after a page is unpinned for DMA but before the page is recycled. Therefore, a mitigation in VFIO/IOMMUFD is required to flush CPU caches on pages involved in non-coherent DMAs prior to or following their mapping or unmapping to or from the IOMMU. The mitigation is not implemented in DMA API layer, so as to avoid slowing down the DMA API users. Users of the DMA API are expected to take care of CPU cache flushing in one of two ways: (a) by using the DMA API which is aware of the non-coherence and does the flushes internally or (b) be aware of its flushing needs and handle them on its own if they are overriding the platform using no-snoop. A general mitigation in DMA API layer will only come when non-coherent DMAs are common, which, however, is not the case (now only Intel GPU and some ARM devices). Also the mitigation is not implemented in IOMMU core for VMs exclusively, because it would make a large IOTLB flush range being split due to the absence of information regarding to IOVA-PFN relationship in IOMMU core. Given non-coherent devices exist both on x86 and ARM, this series introduces an arch helper to flush CPU caches for non-coherent DMAs which is available for both VFIO and IOMMUFD, though current only implementation for x86 is provided. Series Layout: Patch 1 first fixes an error in pat_pfn_immune_to_uc_mtrr() which always returns WB for untracked PAT ranges. This error leads to KVM treating all PFNs within these untracked PAT ranges as cacheable memory types, even when a PFN's MTRR type is UC. (An example is for VGA range from 0xa0000-0xbffff). Patch 3 will use pat_pfn_immune_to_uc_mtrr() to determine uncacheable PFNs. Patch 2 is a side fix in KVM to prevent guest cacheable access to PFNs mapped as UC in host. Patch 3 introduces and exports an arch helper arch_clean_nonsnoop_dma() to flush CPU cachelines. It takes physical address and size as inputs and provides a implementation for x86. Given that executing CLFLUSH on certain MMIO ranges on x86 can be problematic, potentially causing machine check exceptions on some platforms, while flushing is necessary on some other MMIO ranges (e.g., some MMIO ranges for PMEM), this patch determines cacheability by consulting the PAT (if enabled) or MTRR type (if PAT is disabled). It assesses whether a PFN is considered as uncacheable by the host. For reserved pages or !pfn_valid() PFN, CLFLUSH is avoided if the PFN is recognized as uncacheable on the host. Patch 4/5 implement a mitigation in vfio/iommufd to flush CPU caches - before a page is accessible to non-coherent DMAs, - after the page is inaccessible to non-coherent DMAs, and right before it's unpinned for DMAs. Performance data: The overhead of flushing CPU caches is measured below: CPU MHz:4494.377, 4 vCPU, 8G guest memory Pass-through GPU: 1G aperture Across each VM boot up and tear down, IOMMUFD | Map | Unmap | Teardown ------------|----------------|----------------|------------- w/o clflush | 1167M | 40M | 201M w/ clflush | 2400M (+1233M) | 276M (+236M) | 1160M (+959M) Map = total cycles of iommufd_ioas_map() during VM boot up Unmap = total cycles of iommufd_ioas_unmap() during VM boot up Teardown = total cycles of iommufd_hwpt_paging_destroy() at VM teardown VFIO | Map | Unmap | Teardown ------------|----------------|----------------|------------- w/o clflush | 3058M | 379M | 448M w/ clflush | 5664M (+2606M) | 1653M (+1274M) | 1522M (+1074M) Map = total cycles of vfio_dma_do_map() during VM boot up Unmap = total cycles of vfio_dma_do_unmap() during VM boot up Teardown = total cycles of vfio_iommu_type1_detach_group() at VM teardown [1] https://lore.kernel.org/lkml/20240109002220.GA439767@xxxxxxxxxx Yan Zhao (5): x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is MMIO x86/mm: Introduce and export interface arch_clean_nonsnoop_dma() vfio/type1: Flush CPU caches on DMA pages in non-coherent domains iommufd: Flush CPU caches on DMA pages in non-coherent domains arch/x86/include/asm/cacheflush.h | 3 + arch/x86/kvm/mmu/spte.c | 14 +++- arch/x86/mm/pat/memtype.c | 12 +++- arch/x86/mm/pat/set_memory.c | 88 +++++++++++++++++++++++++ drivers/iommu/iommufd/hw_pagetable.c | 19 +++++- drivers/iommu/iommufd/io_pagetable.h | 5 ++ drivers/iommu/iommufd/iommufd_private.h | 1 + drivers/iommu/iommufd/pages.c | 44 ++++++++++++- drivers/vfio/vfio_iommu_type1.c | 51 ++++++++++++++ include/linux/cacheflush.h | 6 ++ 10 files changed, 237 insertions(+), 6 deletions(-) base-commit: e67572cd2204894179d89bd7b984072f19313b03 -- 2.17.1