From: Yan Zhao <yan.y.zhao@xxxxxxxxx> Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf SPTEs when deleting a memslot. Today "zapping only memslot leaf SPTEs" on memslot deletion is not done. Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for AMD) and generate fresh new TDPs based on the new memslot layout. This is because zapping and re-generating TDPs is low overhead for most use cases, and more importantly, it's due to a bug [1] which caused VM instability when a VM is with Nvidia Geforce GPU assigned. There's a previous attempt [2] to introduce a per-VM flag to workaround bug [1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs. However, [2] was not merged due to lacking of a clear explanation of exactly what is broken [3] and it's not wise to "have a bug that is known to happen when you enable the capability". However, for some specific scenarios, e.g. TDX, invalidating and re-generating a new page table is not viable for reasons: - TDX requires root page of private page table remains unaltered throughout the TD life cycle. - TDX mandates that leaf entries in private page table must be zapped prior to non-leaf entries. So, Sean re-considered about introducing a per-VM flag or per-memslot flag again for VMs like TDX. [4] This patch is an implementation of per-memslot flag. Compared to per-VM flag approach, Pros: (1) By allowing userspace to control the zapping behavior in fine-grained granularity, optimizations for specific use cases can be developed without future kernel changes. (2) Allows developing new zapping behaviors without risking regressions by changing KVM behavior, as seen previously. Cons: (1) Users need to ensure all necessary memslots are with flag KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD memslot is with ZAP_LEAFS_ONLY flag for TDX VM. (2) Opens up the possibility that userspace could configure memslots for normal VM in such a way that the bug [1] is seen. However, one thing deserves noting for TDX, is that TDX may potentially meet bug [1] for either per-memslot flag or per-VM flag approach, since there's a usage in radar to assign an untrusted & passthrough GPU device in TDX. If that happens, it can be treated as a bug (not regression) and fixed accordingly. An alternative approach we can also consider is to always invalidate & rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored and private page tables on memslot deletion. This approach could exempt TDX from bug [1] when "untrusted & passthrough" devices are involved. But downside is that this approach requires creating new very specific KVM zapping ABI that could limit future changes in the same way that the bug did for normal VMs. Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@xxxxxxxxx [1] Link: https://lore.kernel.org/kvm/20200713190649.GE29725@xxxxxxxxxxxxxxx/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2] Link: https://lore.kernel.org/kvm/20200713190649.GE29725@xxxxxxxxxxxxxxx/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3] Link: https://lore.kernel.org/kvm/ZhSYEVCHqSOpVKMh@xxxxxxxxxx [4] Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> --- TDX MMU Part 1: - New patch --- arch/x86/kvm/mmu/mmu.c | 30 +++++++++++++++++++++++++++++- arch/x86/kvm/x86.c | 17 +++++++++++++++++ include/uapi/linux/kvm.h | 1 + virt/kvm/kvm_main.c | 5 ++++- 4 files changed, 51 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 61982da8c8b2..4a8e819794db 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6962,10 +6962,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm) kvm_mmu_zap_all(kvm); } +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot) +{ + if (KVM_BUG_ON(!tdp_mmu_enabled, kvm)) + return; + + write_lock(&kvm->mmu_lock); + + /* + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst + * case scenario we'll have unused shadow pages lying around until they + * are recycled due to age or when the VM is destroyed. + */ + struct kvm_gfn_range range = { + .slot = slot, + .start = slot->base_gfn, + .end = slot->base_gfn + slot->npages, + .may_block = true, + }; + + if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false)) + kvm_flush_remote_tlbs(kvm); + + write_unlock(&kvm->mmu_lock); +} + void kvm_arch_flush_shadow_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { - kvm_mmu_zap_all_fast(kvm); + if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY) + kvm_mmu_zap_memslot_leafs(kvm, slot); + else + kvm_mmu_zap_all_fast(kvm); } void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7c593a081eba..4b3ec2ec79e9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12952,6 +12952,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn()) return -EINVAL; + /* + * Since TDX private pages requires re-accepting after zap, + * and TDX private root page should not be zapped, TDX requires + * memslots for private memory must have flag + * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of + * the deleting memslot will be zapped and SPTEs in other + * memslots would not be affected. + */ + if (kvm->arch.vm_type == KVM_X86_TDX_VM && + (new->flags & KVM_MEM_GUEST_MEMFD) && + !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY)) + return -EINVAL; + + /* zap-leafs-only works only when TDP MMU is enabled for now */ + if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled) + return -EINVAL; + return kvm_alloc_memslot_metadata(kvm, new); } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index aee67912e71c..d53648c19b26 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 { #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) #define KVM_MEM_GUEST_MEMFD (1UL << 2) +#define KVM_MEM_ZAP_LEAFS_ONLY (1UL << 3) /* for KVM_IRQ_LINE */ struct kvm_irq_level { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 81b90bf03f2f..1b1ffb6fc786 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1568,6 +1568,8 @@ static int check_memory_region_flags(struct kvm *kvm, if (kvm_arch_has_private_mem(kvm)) valid_flags |= KVM_MEM_GUEST_MEMFD; + valid_flags |= KVM_MEM_ZAP_LEAFS_ONLY; + /* Dirty logging private memory is not currently supported. */ if (mem->flags & KVM_MEM_GUEST_MEMFD) valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES; @@ -2052,7 +2054,8 @@ int __kvm_set_memory_region(struct kvm *kvm, return -EINVAL; if ((mem->userspace_addr != old->userspace_addr) || (npages != old->npages) || - ((mem->flags ^ old->flags) & KVM_MEM_READONLY)) + ((mem->flags ^ old->flags) & + (KVM_MEM_READONLY | KVM_MEM_ZAP_LEAFS_ONLY))) return -EINVAL; if (base_gfn != old->base_gfn) -- 2.34.1