This RFC series proposes a framework to resolve IOPF by sharing KVM TDP (Two Dimensional Paging) page table to IOMMU as its stage 2 paging structure to support IOPF (IO page fault) on IOMMU's stage 2 paging structure. Previously, all guest pages have to be pinned and mapped in IOMMU stage 2 paging structures after pass-through devices attached, even if the device has IOPF capability. Such all-guest-memory pinning can be avoided when IOPF handling for stage 2 paging structure is supported and if there are only IOPF-capable devices attached to a VM. There are 2 approaches to support IOPF on IOMMU stage 2 paging structures: - Supporting by IOMMUFD/IOMMU alone IOMMUFD handles IO page faults on stage-2 HWPT by calling GUPs and then iommu_map() to setup IOVA mappings. (IOAS is required to keep info of GPA to HVA, but page pinning/unpinning needs to be skipped.) Then upon MMU notifiers on host primary MMU, iommu_unmap() is called to adjust IOVA mappings accordingly. IOMMU driver needs to support unmapping sub-ranges of a previous mapped range and take care of huge page merge and split in atomic way. [1][2]. - Sharing KVM TDP IOMMUFD sets the root of KVM TDP page table (EPT/NPT in x86) as the root of IOMMU stage 2 paging structure, and routes IO page faults to KVM. (This assumes that the iommu hw supports the same stage-2 page table format as CPU.) In this model the page table is centrally managed by KVM (mmu notifier, page mapping, subpage unmapping, atomic huge page split/merge, etc.), while IOMMUFD only needs to invalidate iotlb/devtlb properly. Currently, there's no upstream code available to support stage 2 IOPF yet. This RFC chooses to implement "Sharing KVM TDP" approach which has below main benefits: - Unified page table management The complexity of allocating guest pages per GPAs, registering to MMU notifier on host primary MMU, sub-page unmapping, atomic page merge/split are only required to by handled in KVM side, which has been doing that well for a long time. - Reduced page faults: Only one page fault is triggered on a single GPA, either caused by IO access or by vCPU access. (compared to one IO page fault for DMA and one CPU page fault for vCPUs in the non-shared approach.) - Reduced memory consumption: Memory of one page table are saved. Design == In this series, term "exported" is used in place of "shared" to avoid confusion with terminology "shared EPT" in TDX. The framework contains 3 main objects: "KVM TDP FD" object - The interface of KVM to export TDP page tables. With this object, KVM allows external components to access a TDP page table exported by KVM. "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver. This HWPT has no IOAS associated. "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging structures are managed by KVM. Its hardware TLB invalidation requests are notified from KVM via IOMMUFD KVM HWPT object. 2.IOMMU_HWPT_ALLOC(fd) 1. KVM_CREATE_TDP_FD .------. +--------------| QEMU |----------------------+ | '------'<---+ fd | | | v | | .-------. v | create | KVM | .------------------. .------------.<------'-------' | IOMMUFD KVM HWPT | | KVM TDP FD | | '------------------' '------------' | | kvm_tdp_fd_get(fd) | | |------------------------->| | IOMMU | | | driver alloc(meta) |---------get meta-------->| | .------------.<---------| | | | KVM Domain | |----register_importer---->| | '------------' | | | | | | | | 3. | | | |----iopf handler---->|----------fault---------->|------map------->| | | | 4. | |<-------invalidate---|<-------invalidate--------|<---TLB flush----| | | | | |<-----free-----------| 5. | | |----unregister_importer-->| | | | | |------------------------->| | kvm_tdp_fd_put() 1. QEMU calls KVM_CREATE_TDP_FD to create a TDP FD object. Address space must be specified to identify the exported TDP page table (e.g. system memory or SMM mode system memory in x86). 2. QEMU calls IOMMU_HWPT_ALLOC to create a KVM-type HWPT. The KVM-type HWPT is created upon an exported KVM TDP FD (rather than upon an IOAS), acting as the proxy between KVM TDP and IOMMU driver: - Obtain reference on the exported KVM TDP FD. - get and pass meta data of KVM TDP page tables to IOMMU driver for KVM domain allocation. - register importer callbacks to KVM for invalidation notification. - register a IOPF handler into IOMMU's KVM domain. Upon device attachment, the root HPA of the exported TDP page table is installed to IOMMU hardware. 3. When IO page faults come, IOMMUFD fault handler forwards the fault to KVM. 4. When KVM performs TLB flush, it notifies all importers of KVM TDP FD object. IOMMUFD KVM HWPT, as an importer, will pass the notification to IOMMU driver for hardware TLB invalidations. 5. On destroy IOMMUFD KVM HWPT, it frees IOMMU's KVM domain, unregisters itself as an importer from KVM TDP FD object and puts reference count of KVM TDP FD object. Status == Current support of IOPF on IOMMU stage 2 paging structure is verified on Intel DSA devices on Intel SPR platform. There's no vIOMMU for guest and Intel DSA devices run in-kernel DMA tests successfully with IOPFs handled in host. - Nested translation in IOMMU is currently not supported. - QEMU code in IOMMUFD to create KVM HWPT is just a temporary hack. As KVM HWPT has no IOAS associated, need to fit in current QEMU code to create KVM HWPT with no IOAS and to ensure the address space is from GPA to HPA. - DSA IOPF hack in guest driver. Although DSA hw tolerates IOPF in all DMA paths, DSA driver has the flexibility to turn off IOPF in certain paths. This RFC currently hacks the guest driver to always turn on IOPF. Note == - KVM page write-tracking Unlike write-protection which usually adds back the write permission upon a write fault and re-executes the faulting instruction, KVM page write-tracking keeps the write permission disabled for the tracked pages and instead always emulates the faulting instruction upon fault. There is no way to emulate a faulting DMA request so IOPF and KVM page write-tracking are incompatible. In this RFC we didn't handle the conflict given write-tracking is applied to guest page table pages so far, which are unlikely to be used as DMA buffer. - IOMMU page-walk coherency It's about whether IOMMU hardware will snoop the processor cache of the I/O paging structures. If IOMMU page-walk is non-coherent, the software needs to do clflush after changing the I/O paging structures. Supporting non-coherent IOMMU page-walk adds extra burden (i.e. clflush) in KVM mmu in this shared model, which we don't plan to support. Fortunately most Intel platforms do support coherent page-walk in IOMMU so this exception should not be a big matter. - Non-coherent DMA Non-coherent DMA requires KVM mmu to align the effective memory type with the guest memory type (CR0.CD, vPAT, vMTRR) instead of forcing all guest memory to be WB. It further involves complexities in fault handler to check guest memory type too which requires a vCPU context. There is certainly no vCPU context in an I/O page fault. So this RFC doesn't support devices which cannot be enforced to do coherent DMA. If there is interest in supporting non-coherent DMA in this shared model, there's a discussion about removing vMTRR stuffs in KVM page fault handler [3] hence it's also possible to further remove the vCPU context there. - Enforce DMA cache coherency This design requires the IOMMU supporting a configuration forcing all DMAs to be coherent (even if the PCI request out of the device sets the non-snoop bit) due to aforementioned reason. The control of enforcing cache coherency could be per-IOPT or per-page. e.g. Intel VT-d defines a per-page format (bit 11 in PTE represents the enforce-snoop bit) in legacy mode and a per-IOPT format (control bit in the pasid entry) in scalable mode. Supporting per-page format requires KVM mmu to disable any software use of bit 11 and also provide additional ops for on-demand set/clear-snp requests from iommufd. It's complex and dirty. Therefore the per-IOPT scheme is assumed in this design. For Intel IOMMU, the scalable mode is the default mode for all new IOMMU features (nested translation, pasid, etc.) anyway. - About device which partially supports IOPF Many devices claiming PCIe PRS capability actually only tolerate IOPF in certain paths (e.g. DMA paths for SVM applications, but not for non-SVM applications or driver data such as ring descriptors). But the PRS capability doesn't include a bit to tell whether a device 100% tolerates IOPF in all DMA paths. This creates a trouble how the userspace driver framework (e.g. VFIO) knows that a device with PRS can really avoid static-pinning of the entire guest memory and then reports such knowledge to the VMM. A simple way is to track an allowed list of devices which are known 100% IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow device reporting whether it fully or partially supports IOPF in the PRS capability. Another interesting option is to explore supporting partial-IOPF in this sharing model: * Create a VFIO variant driver to intercept guest operations which registers non-faultable memory to the device and to call KVM TDP ops to request on-demand pinning of traped memory pages in KVM mmu. This allows the VMM to start with zero-pinning as for 100%-faultable device with on demand pinning initiated by the variant driver. * Supporting on-demand pinning in KVM mmu however requires non-trivial effort. Besides introducing logic to pin pages in long term and manage the list of pinned GFNs, more caveats are required to avoid breaking the implication of page pinning, e.g.: a. PTE updates in a pinned GFN range must be atomic, otherwise an in-fly DMA might be broken b. PTE zap in a pinned GFN range is allowed only when the related memory slot is removed (indicating guest won't use it for DMA). The PTE zap for the affected range must be either disabled or replaced by an atomic update. c. any feature related to write-protecting the pinned GFN range is not allowed. This implies live migration is also broken in current way as it starts with write-protection even when TDP dirty bit tracking is enabled. To support on-demand pinning it then requires to rely on a less efficient way by always walking TDP dirty bit instead of using write-protection. Or, we may enhance the live migration code to treat pinned ranges as dirty always. d. Auto NUMA balance also needs to be disabled. [4] If above trickiness can be resolved cleanly, this sharing model could also support a non-faultable device in theory by pinning/unpinning guest memory on slot addition/removal. - How to map MSI page on arm platform demands discussions. Patches layout == [01-08]: Skeleton implementation of KVM's TDP FD object. Patch 1 and 2 are for public and arch specific headers. Patch 4's commit message outlines overall data structure hierarchy on x86 for preview. [09-23]: IOMMU, IOMMUFD and Intel vt-d. - 09-11: IOMMU core part - 12-16: IOMMUFD part Patch 13 is the main patch in IOMMUFD to implement KVM HWPT. - 17-23: Intel vt-d part for KVM domain Patch 18 is the main patch to implement KVM domain. [24-42]: KVM x86 and VMX part - 24-34: KVM x86 preparation patches. Patch 24: Let KVM to reserve bit 11 since bit 11 is reserved as 0 in IOMMU side. Patch 25: Abstract "struct kvm_mmu_common" from "struct kvm_mmu" for "kvm_exported_tdp_mmu" Patches 26~34: Prepare for page fault in non-vCPU context. - 35-38: Core part in KVM x86 Patch 35: X86 MMU core part to show how exported TDP root page is shared between KVM external components and vCPUs. Patch 37: TDP FD fault op implementation - 39-42: KVM VMX part for meta data composing and tlb flush notification. Code base == The code base is commit b85ea95d08647 ("Linux 6.7-rc1") + Yi Liu's v7 series "Add Intel VT-d nested translation (part 2/2)" [5] + Baolu's v7 series "iommu: Prepare to deliver page faults to user space" [6] Complete code can be found at [7], Qemu could be found at [8], Guest test script and workaround patch is at [9]. [1] https://lore.kernel.org/all/20230814121016.32613-1-jijie.ji@xxxxxxxxxxxxxxxxx/ [2] https://lore.kernel.org/all/BN9PR11MB5276D897431C7E1399EFFF338C14A@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ [3] https://lore.kernel.org/all/ZUAC0jvFE0auohL4@xxxxxxxxxx/ [4] https://lore.kernel.org/all/4cb536f6-2609-4e3e-b996-4a613c9844ad@xxxxxxxxxx/ [5] https://lore.kernel.org/linux-iommu/20231117131816.24359-1-yi.l.liu@xxxxxxxxx/ [6] https://lore.kernel.org/linux-iommu/20231115030226.16700-1-baolu.lu@xxxxxxxxxxxxxxx/ [7] https://github.com/yanhwizhao/linux_kernel/tree/sharept_iopt [8] https://github.com/yanhwizhao/qemu/tree/sharept_iopf [9] https://github.com/yanhwizhao/misc/tree/master Yan Zhao (42): KVM: Public header for KVM to export TDP KVM: x86: Arch header for kvm to export TDP for Intel KVM: Introduce VM ioctl KVM_CREATE_TDP_FD KVM: Skeleton of KVM TDP FD object KVM: Embed "arch" object and call arch init/destroy in TDP FD KVM: Register/Unregister importers to KVM exported TDP KVM: Forward page fault requests to arch specific code for exported TDP KVM: Add a helper to notify importers that KVM exported TDP is flushed iommu: Add IOMMU_DOMAIN_KVM iommu: Add new iommu op to create domains managed by KVM iommu: Add new domain op cache_invalidate_kvm iommufd: Introduce allocation data info and flag for KVM managed HWPT iommufd: Add a KVM HW pagetable object iommufd: Enable KVM HW page table object to be proxy between KVM and IOMMU iommufd: Add iopf handler to KVM hw pagetable iommufd: Enable device feature IOPF during device attachment to KVM HWPT iommu/vt-d: Make some macros and helpers to be extern iommu/vt-d: Support of IOMMU_DOMAIN_KVM domain in Intel IOMMU iommu/vt-d: Set bit PGSNP in PASIDTE if domain cache coherency is enforced iommu/vt-d: Support attach devices to IOMMU_DOMAIN_KVM domain iommu/vt-d: Check reserved bits for IOMMU_DOMAIN_KVM domain iommu/vt-d: Support cache invalidate of IOMMU_DOMAIN_KVM domain iommu/vt-d: Allow pasid 0 in IOPF KVM: x86/mmu: Move bit SPTE_MMU_PRESENT from bit 11 to bit 59 KVM: x86/mmu: Abstract "struct kvm_mmu_common" from "struct kvm_mmu" KVM: x86/mmu: introduce new op get_default_mt_mask to kvm_x86_ops KVM: x86/mmu: change param "vcpu" to "kvm" in kvm_mmu_hugepage_adjust() KVM: x86/mmu: change "vcpu" to "kvm" in page_fault_handle_page_track() KVM: x86/mmu: remove param "vcpu" from kvm_mmu_get_tdp_level() KVM: x86/mmu: remove param "vcpu" from kvm_calc_tdp_mmu_root_page_role() KVM: x86/mmu: add extra param "kvm" to kvm_faultin_pfn() KVM: x86/mmu: add extra param "kvm" to make_mmio_spte() KVM: x86/mmu: add extra param "kvm" to make_spte() KVM: x86/mmu: add extra param "kvm" to tdp_mmu_map_handle_target_level() KVM: x86/mmu: Get/Put TDP root page to be exported KVM: x86/mmu: Keep exported TDP root valid KVM: x86: Implement KVM exported TDP fault handler on x86 KVM: x86: "compose" and "get" interface for meta data of exported TDP KVM: VMX: add config KVM_INTEL_EXPORTED_EPT KVM: VMX: Compose VMX specific meta data for KVM exported TDP KVM: VMX: Implement ops .flush_remote_tlbs* in VMX when EPT is on KVM: VMX: Notify importers of exported TDP to flush TLBs on KVM flushes EPT arch/x86/include/asm/kvm-x86-ops.h | 4 + arch/x86/include/asm/kvm_exported_tdp.h | 43 +++ arch/x86/include/asm/kvm_host.h | 48 ++- arch/x86/kvm/Kconfig | 13 + arch/x86/kvm/mmu.h | 12 +- arch/x86/kvm/mmu/mmu.c | 434 +++++++++++++++++------ arch/x86/kvm/mmu/mmu_internal.h | 8 +- arch/x86/kvm/mmu/paging_tmpl.h | 15 +- arch/x86/kvm/mmu/spte.c | 31 +- arch/x86/kvm/mmu/spte.h | 82 ++++- arch/x86/kvm/mmu/tdp_mmu.c | 209 +++++++++-- arch/x86/kvm/mmu/tdp_mmu.h | 9 + arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 56 ++- arch/x86/kvm/x86.c | 68 +++- drivers/iommu/intel/Kconfig | 9 + drivers/iommu/intel/Makefile | 1 + drivers/iommu/intel/iommu.c | 68 ++-- drivers/iommu/intel/iommu.h | 47 +++ drivers/iommu/intel/kvm.c | 185 ++++++++++ drivers/iommu/intel/pasid.c | 3 +- drivers/iommu/intel/svm.c | 37 +- drivers/iommu/iommufd/Kconfig | 10 + drivers/iommu/iommufd/Makefile | 1 + drivers/iommu/iommufd/device.c | 31 +- drivers/iommu/iommufd/hw_pagetable.c | 29 +- drivers/iommu/iommufd/hw_pagetable_kvm.c | 270 ++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 44 +++ drivers/iommu/iommufd/main.c | 4 + include/linux/iommu.h | 18 + include/linux/kvm_host.h | 58 +++ include/linux/kvm_tdp_fd.h | 137 +++++++ include/linux/kvm_types.h | 12 + include/uapi/linux/iommufd.h | 15 + include/uapi/linux/kvm.h | 19 + virt/kvm/Kconfig | 6 + virt/kvm/Makefile.kvm | 1 + virt/kvm/kvm_main.c | 24 ++ virt/kvm/tdp_fd.c | 344 ++++++++++++++++++ virt/kvm/tdp_fd.h | 15 + 41 files changed, 2177 insertions(+), 247 deletions(-) create mode 100644 arch/x86/include/asm/kvm_exported_tdp.h create mode 100644 drivers/iommu/intel/kvm.c create mode 100644 drivers/iommu/iommufd/hw_pagetable_kvm.c create mode 100644 include/linux/kvm_tdp_fd.h create mode 100644 virt/kvm/tdp_fd.c create mode 100644 virt/kvm/tdp_fd.h -- 2.17.1