Hi, This is v2 of the TDX MMU prep series, split out of the giant 130 patch TDX base enabling series [0]. It is focusing on the changes to the x86 MMU to support TDX’s separation of private/shared EPT into separate roots. A future breakout series will include the changes to actually interact with the TDX module to actually map private memory. The purpose of sending out a smaller series is to focus review, and hopefully rapidly iterate. We would like the series to go into kvm-coco-queue when it is ready. I think the maturity of these patches has significantly improved during the recent reviews. I expecting it still needs a little more work, but think that the basic structure is in decent shape at this point. Please consider it from the perspective of what is missing for inclusion in kvm-coco-queue. There is a larger team working on TDX KVM base enabling. The patches were originally authored by Sean Christopherson and Isaku Yamahata, but otherwise it especially represents the work of Isaku and Yan Y Zhao and myself. The series has been tested as part of a development branch for the TDX base series [1]. The testing of this series consists TDX kvm-unit-tests [2], regular KVM selftests, and booting a Linux TD. Rebasing the TDX selftests is still in progress. Updates from v2 =============== Most of the changes for v2 are around code clarity. At a high level: - Previously the (described below) mirror changes had a bunch of "private" naming in the generic MMU code. These have been renamed to "mirror". The connection between TDX private memory and mirrored memory inside mmu.h helpers. Now the generic MMU code simply selects the mirror root when performing private operations (faults, invalidation, etc). The TDX seamcalls to update the S-EPT are hidden inside the x86_ops, so the special TDX private parts don't pollute the concept for SEV. To do this the roots enum is changed to be about "mirror" and "direct" instead of "private" and "shared". The process enum retains these concepts, and TDX specific logic matches between the two. So there is a little more purpose to have two enums. - The gfn_shared_mask is now gfn_direct_mask, and it simply applies the mask when operating on the direct moot (i.e. a fault targeting a “gfn” on a normal “direct” root gets faulted in at gfn|gfn_direct_mask). Like the private memory connection, the connection between the normal "direct" root and shared memory is hidden inside mmu.h helpers. - The shared bit has been removed from all of the gfn_t's. This is achieved by applying the gfn_direct_mask from within the TDP MMU iterator. It prevents more of the MMU code to have to take special care around GFN's. - There also has been some re-organization and patch splitting to aid in reviewability. - Log improvements The quirk for zapping all PTEs on memslot deletion will come as a separate series. The change to make kvm_mmu_max_gfn() configurable will also come separately. The kvm_zap_gfn_range() KVM_BUG_ON() patch is dropped. Private/shared memory in TDX ============================ Confidential computing solutions have concepts of private and shared memory. Often the guest accesses either private or shared memory via a bit in the PTE. Solutions like SEV treat this bit more like a permission bit, where solutions like TDX and ARM CCA treat it more like a GPA bit. In the latter case, the host maps private memory in one half of the address space and shared in another. For TDX these two halves are mapped by different EPT roots. The private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module. The shared half is managed by the untrusted part of the VMM (KVM). In addition to the separate roots for private and shared, there are limitations on what operations can be done on the private side. TDX wants to protect against protected memory being reset or otherwise scrambled by the host. In order to prevent this, the guest has to take specific action to “accept” memory after changes are made by the VMM to the private EPT. This prevents the VMM from performing many of the usual memory management operations that involve zapping and refaulting memory. The private memory also is always RWX and cannot have VMM specified cache attribute attributes applied. TDX KVM MMU Design For Private Memory ===================================== Private/shared split -------------------- The operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a “mirrored” copy of the private EPT in KVM’s memory. This will allow KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. To clarify the definitions of the three EPT trees at this point: private EPT - Protected by the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not mapped. shared EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. It’s worth noting that we are making an effort to remove optimizations that have complexity for the base enabling. Although keeping a mirrored copy of the private page tables kind of fits into that category, it has been so fundamental to the design for so long, dropping it would be too disruptive. Mirror EPT ------------ The mirror EPT needs to keep a mirrored version of the private EPT maintained in the TDX module in order to be able to find out if a GPA’s mid-level pagetable have already been installed. So this mirrored copy has the same structure as the private EPT, having a page table present for every GPA range and level in the mirrored EPT where a page table is present private. The private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast() Modifications to the mirrored page tables need to also perform the same operations to the private page tables. The actual TDX module calls to do this are not covered in this prep series. For convenience SPs for private page tables are tracked with a role bit out of convenience. (Note to reviewers, please consider if this is really needed). Zapping Changes --------------- For normal VMs, guest memory is zapped for several reasons, like user memory getting paged out by the guest, memslots getting deleted or virtualization operations like MTRRs, and attachment of non-coherent DMA. For TDX (and SNP) there is also zapping associated with the conversion of memory between shared and privates. These operations need to take care to do two things: 1. Not zap any private memory that is in use by the guest. 2. Not zap any memory alias unnecessarily (i.e. Don’t zap anything more than needed). The purpose of this is to not have any unnecessary behavior userspace could grow to rely on. For 1, this is possible because the zapping that is out of the control of KVM/userspace (paging out of userspace memory) will only apply to shared memory. Guest mem fd operations are protected from mmu notifier operations. During TD runtime, zapping of private memory will only be from memslot deletion and from conversion between private and shared memory which is triggered by the guest. For 2, KVM needs to be taught which operations will operate on which aliases. An enum based scheme is introduced such that operations can target specific aliases like: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only MTRRs, etc - Zapping will be avoided all together For zapping arising from other virtualization based operations, there are four scenarios: 1. MTRR update 2. CR0.CD update 3. APICv update 4. Non-coherent DMA status update KVM TDX will not support 1-3. In future changes (after this series) the features will not be supported for TDX. For 4, there isn’t an easy way to not support the feature as the notification is just passed to KVM and it has to act accordingly. However, other proposed changes [3] will avoid the need for zapping on non-coherent DMA notification for selfsnoop CPUs. So KVM can follow this logic and just always honor guest PAT for shared memory. Atomically updating private EPT ------------------------------- Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module, but there is one tricky property that is worth elaborating on. That is how to handle that the TDP MMU allows modification of PTEs with the mmu_lock held only for read and uses the PTEs themselves to perform synchronization. Unfortunately while operating on a single PTE can be done atomically, operating on both the mirrored and private PTEs at the same time needs additional solution. To handle this situation, REMOVED_SPTE is used to prevent concurrent operations while a call to the TDX module updates the private EPT. The series is based on kvm-coco-queue. [0] https://lore.kernel.org/kvm/cover.1708933498.git.isaku.yamahata@xxxxxxxxx/ [1] https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-05-30 [2] https://lore.kernel.org/kvm/20231218072247.2573516-1-qian.wen@xxxxxxxxx/ [3] https://lore.kernel.org/kvm/20240309010929.1403984-6-seanjc@xxxxxxxxxx/ Isaku Yamahata (12): KVM: Add member to struct kvm_gfn_range for target alias KVM: x86/mmu: Add a mirrored pointer to struct kvm_mmu_page KVM: x86/mmu: Add a new mirror_pt member for union kvm_mmu_page_role KVM: x86/mmu: Support GFN direct mask KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root() KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type KVM: x86/tdp_mmu: Support mirror root for TDP MMU KVM: x86/tdp_mmu: Reflect building mirror page tables KVM: x86/tdp_mmu: Reflect tearing down mirror page tables KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots() KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Rick Edgecombe (2): KVM: x86: Add a VM type define for TDX KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void Sean Christopherson (1): KVM: x86/tdp_mmu: Invalidate correct roots arch/x86/include/asm/kvm-x86-ops.h | 4 + arch/x86/include/asm/kvm_host.h | 45 +++- arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/mmu.h | 28 +++ arch/x86/kvm/mmu/mmu.c | 37 ++- arch/x86/kvm/mmu/mmu_internal.h | 68 +++++- arch/x86/kvm/mmu/spte.h | 5 + arch/x86/kvm/mmu/tdp_iter.c | 5 +- arch/x86/kvm/mmu/tdp_iter.h | 16 +- arch/x86/kvm/mmu/tdp_mmu.c | 352 +++++++++++++++++++++++------ arch/x86/kvm/mmu/tdp_mmu.h | 45 +++- arch/x86/kvm/x86.c | 10 + include/linux/kvm_host.h | 8 + virt/kvm/guest_memfd.c | 2 + virt/kvm/kvm_main.c | 14 ++ 15 files changed, 537 insertions(+), 103 deletions(-) base-commit: 698ca1e403579ca00e16a5b28ae4d576d9f1b20e -- 2.34.1