Hi, This is a TDX prep series, split out of the giant 130 patch TDX base enabling series [0]. It is focusing on the changes to the KVM MMU to support TDX’s separation of private/shared EPT. A future breakout series will include the changes to interact with the TDX module to actually map private memory. The purpose of sending out a smaller series is to focus review, and hopefully rapidly iterate on the smaller series. It is not quite ready for upstream inclusion yet, but it as reached another point where more public comments could help. There is a larger team working on TDX KVM base enabling. Most patches were originally authored by Sean Christopherson and Isaku Yamahata, with recent work by Yan Y Zhao, Isaku and myself. The series has been tested as part of a development branch for the TDX base series [1]. The testing so far consists TDX kvm-unit-tests [2] and booting a Linux TD, and regular KVM selftests (not the TDX ones). Contents of the series ====================== There are some simple preparatory patches mixed into the series, which is ordered with bisectability in mind. The patches that most likely need further discussion are: KVM: x86/mmu: Introduce a slot flag to zap only slot leafs on slot deletion Looking at expanding the need for TDX to zap only the specific PTEs for a memslot on deletion, into a general KVM feature. KVM: Add member to struct kvm_gfn_range for target alias Discussion on how to target zapping to the appropriate private/shared alias. KVM: x86/mmu: Bug the VM if kvm_zap_gfn_range() is called for TDX A change that includes a discussion on how to handle cache attributes on shared memory. KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU A big "how to do private/shared split" patch. Patches 11-15: Handling the separate aliases on MMU operations, first started in "Add member to struct kvm_gfn_range for target alias" Private/shared memory in TDX ============================ Confidential computing solutions have concepts of private and shared memory. Often the guest accesses either private or shared memory via a bit in the PTE. Solutions like SEV treat this bit more like a permission bit, where solutions like TDX and ARM CCA treat it more like a GPA bit. In the latter case, the host maps private memory in one half of the address space and shared in another. For TDX these two halves are mapped by different EPT roots. The private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module. The shared half is managed by the untrusted part of the VMM (KVM). In addition to the separate roots for private and shared, there are limitations on what operations can be done on the private side. TDX wants to protect against protected memory being reset or otherwise scrambled by the host. In order to prevent this, the guest has to take specific action to “accept” memory after changes are made by the VMM to the private EPT. This prevents the VMM from performing many of the usual memory management operations that involve zapping and refaulting memory. The private memory also is always RWX and cannot have VMM specified cache attribute attributes applied. TDX KVM MMU Design For Private Memory ===================================== Private/shared split -------------------- The operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a “mirrored” copy of the private EPT in KVM’s memory. This will allow KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. To clarify the definitions of the three EPT trees at this point: private EPT - Protected by the TDX module, modified via TDX module calls. mirrored EPT - Bookkeeping tree used as an optimization by KVM, not mapped. shared EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. It’s worth noting that we are making an effort to remove optimizations that have complexity for the base enabling. Although keeping a mirrored copy of the private page tables kind of fits into that category, it has been so fundamental to the design for so long, dropping it would be too disruptive. Mirrored EPT ------------ The mirrored EPT needs to keep a mirrored version of the private EPT maintained in the TDX module in order to be able to find out if a GPA’s mid-level pagetable have already been installed. So this mirrored copy has the same structure as the private EPT, having a page table present for every GPA range and level in the mirrored EPT where a page table is present private. The private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast() Modifications to the mirrored page tables need to also perform the same operations to the private page tables. The actual TDX module calls to do this are not covered in this prep series. For convenience SPs for private page tables are tracked with a role bit out of convenience. (Note to reviewers, please consider if this is really needed). Zapping Changes --------------- For normal VMs, guest memory is zapped for several reasons, like user memory getting paged out by the guest, memslots getting deleted or virtualization operations like MTRRs, and attachment of non-coherent DMA. For TDX (and SNP) there is also zapping associated with the conversion of memory between shared and privates. These operations need to take care to do two things: 1. Not zap any private memory that is in use by the guest. 2. Not zap any memory alias unnecessarily (i.e. Don’t zap anything more than needed). The purpose of this is to not have any unnecessary behavior userspace could grow to rely on. For 1, this is possible because the zapping that is out of the control of KVM/userspace (paging out of userspace memory) will only apply to shared memory. Guest mem fd operations are protected from mmu notifier operations. During TD runtime, zapping of private memory will only be from memslot deletion and from conversion between private and shared memory which is triggered by the guest. For 2, KVM needs to be taught which operations will operate on which aliases. An enum based scheme is introduced such that operations can target specific aliases like: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only MTRRs, etc - Zapping will be avoided all together For zapping arising from other virtualization based operations, there are four scenarios: 1. MTRR update 2. CR0.CD update 3. APICv update 4. Non-coherent DMA status update KVM TDX will not support 1-3. In future changes (after this series) the features will not be supported for TDX. For 4, there isn’t an easy way to not support the feature as the notification is just passed to KVM and it has to act accordingly. However, other proposed changes [3] will avoid the need for zapping on non-coherent DMA notification for selfsnoop CPUs. So KVM can follow this logic and just always honor guest PAT for shared memory. See more details in patch 8. Prevention of zapping mid-level PTEs ------------------------------------ As mentioned earlier, private PTEs (and so also mirrored PTEs) need to be zapped at the leafs only. This means for TDX, the fast zap roots optimization for memslot deletion is not compatible, and instead only the leafs should be zapped. Behavior like this for memslot deletion was tried[4] once before for normal VMs, and fortunately it exposed a mysterious bug affecting an nVidia GPU in a Windows guest that was never root caused. Since the restrictions on not zapping roots is only for private memory, TDX could minimize the possibility of being exposed to this by always zapping shared roots, and zapping leafs only for the private alias. However, designing long term ABI around a bug seems wrong. So instead, this series explores creating a new memslot flag that allows for specifying that a memslot should be deleted without zapping other GPA ranges. The expectation would be for userspace to set this on memslots used for TDX. Controlling this behavior at the VM level was also explored. See patch 2 for more information. Atomically updating private EPT ------------------------------- Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module, but there is one tricky property that is worth elaborating on. That is how to handle that the TDP MMU allows modification of PTEs with the mmu_lock held only for read and uses the PTEs themselves to perform synchronization. Unfortunately while operating on a single PTE can be done atomically, operating on both the mirrored and private PTEs at the same time needs additional solution. To handle this situation, REMOVED_SPTE is used to prevent concurrent operations while a call to the TDX module updates the private EPT. For more information see the documentation in patch 18. For more detailed discussion see the "The original TDP MMU and race condition" section of documentation patch [5] The series is based on kvm-coco-queue. [0] https://lore.kernel.org/kvm/cover.1708933498.git.isaku.yamahata@xxxxxxxxx/ [1] https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-05-14-mmu-prep-1 [2] https://lore.kernel.org/kvm/20231218072247.2573516-1-qian.wen@xxxxxxxxx/ [3] https://lore.kernel.org/kvm/20240309010929.1403984-6-seanjc@xxxxxxxxxx/ [4] https://lore.kernel.org/kvm/20200703025047.13987-1-sean.j.christopherson@xxxxxxxxx/ [5] https://github.com/intel/tdx/commit/70cd3c807e547854ea52f56623ce168c7869679e Isaku Yamahata (11): KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA KVM: Add member to struct kvm_gfn_range for target alias KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root() KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type KVM: x86/tdp_mmu: Introduce shared, private KVM MMU root types KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots() KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process Rick Edgecombe (3): KVM: x86: Add a VM type define for TDX KVM: x86/mmu: Bug the VM if kvm_zap_gfn_range() is called for TDX KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void Sean Christopherson (1): KVM: x86/tdp_mmu: Invalidate correct roots Yan Zhao (1): KVM: x86/mmu: Introduce a slot flag to zap only slot leafs on slot deletion arch/x86/include/asm/kvm-x86-ops.h | 5 + arch/x86/include/asm/kvm_host.h | 45 +++- arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/mmu.h | 36 +++ arch/x86/kvm/mmu/mmu.c | 86 +++++- arch/x86/kvm/mmu/mmu_internal.h | 60 ++++- arch/x86/kvm/mmu/spte.h | 5 + arch/x86/kvm/mmu/tdp_iter.h | 2 +- arch/x86/kvm/mmu/tdp_mmu.c | 407 ++++++++++++++++++++++++----- arch/x86/kvm/mmu/tdp_mmu.h | 18 +- arch/x86/kvm/x86.c | 17 ++ include/linux/kvm_host.h | 8 + include/uapi/linux/kvm.h | 1 + virt/kvm/guest_memfd.c | 2 + virt/kvm/kvm_main.c | 19 +- 15 files changed, 632 insertions(+), 80 deletions(-) base-commit: 698ca1e403579ca00e16a5b28ae4d576d9f1b20e -- 2.34.1