On Thu, Mar 13, 2025 at 7:16 PM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: > Improve TDX shutdown performance by adding a more efficient shutdown > operation at the cost of adding separate branches for the TDX MMU > operations for normal runtime and shutdown. This more efficient method was > previously used in earlier versions of the TDX patches, but was removed to > simplify the initial upstreaming. This is an RFC, and still needs a proper > upstream commit log. It is intended to be an eventual follow up to base > support. In the latest code the HKID is released in kvm_arch_pre_destroy_vm(). That is before kvm_free_memslot() calls kvm_gmem_unbind(), which results in fput() and hence kvm_gmem_release(). So, as long as userspace doesn't remove the memslots and close the guestmemfds, shouldn't the TD_TEARDOWN method be usable? Paolo > == Background == > > TDX has 2 methods for the host to reclaim guest private memory, depending > on whether the TD (TDX VM) is in a runnable state or not. These are > called, respectively: > 1. Dynamic Page Removal > 2. Reclaiming TD Pages in TD_TEARDOWN State > > Dynamic Page Removal is much slower. Reclaiming a 4K page in TD_TEARDOWN > state can be 5 times faster, although that is just one part of TD shutdown. > > == Relevant TDX Architecture == > > Dynamic Page Removal is slow because it has to potentially deal with a > running TD, and so involves a number of steps: > Block further address translation > Exit each VCPU > Clear Secure EPT entry > Flush/write-back/invalidate relevant caches > > Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown > procedure (refer tdx_mmu_release_hkid()) has already performed the relevant > flushing. For details, see TDX Module Base Spec October 2024 sections: > > 7.5. TD Teardown Sequence > 5.6.3. TD Keys Reclamation, TLB and Cache Flush > > Essentially all that remains then is to take each page away from the > TDX Module and return it to the kernel. > > == Problem == > > Currently, Dynamic Page Removal is being used when the TD is being > shutdown for the sake of having simpler initial code. > > This happens when guest_memfds are closed, refer kvm_gmem_release(). > guest_memfds hold a reference to struct kvm, so that VM destruction cannot > happen until after they are released, refer kvm_gmem_release(). > > Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total > reclaim time. For example: > > VCPUs Size (GB) Before (secs) After (secs) > 4 18 72 24 > 32 107 517 134 > > Note, the V19 patch set: > > https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@xxxxxxxxx/ > > did not have this issue because the HKID was released early, something that > Sean effectively NAK'ed: > > "No, the right answer is to not release the HKID until the VM is > destroyed." > > https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@xxxxxxxxxx/ > > That was taken on board in the "TDX MMU Part 2" patch set. Refer > "Moving of tdx_mmu_release_hkid()" in: > > https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@xxxxxxxxx/ > > == Options == > > 1. Start TD teardown earlier so that when pages are removed, > they can be reclaimed faster. > 2. Defer page removal until after TD teardown has started. > 3. A combination of 1 and 2. > > Option 1 is problematic because it means putting the TD into a non-runnable > state while it is potentially still active. Also, as mentioned above, Sean > effectively NAK'ed it. > > Option 2 is possible because the lifetime of guest memory pages is separate > from guest_memfd (struct kvm_gmem) lifetime. > > A reference is taken to pages when they are faulted in, refer > kvm_gmem_get_pfn(). That reference is not released until the pages are > removed from the mirror SEPT, refer tdx_unpin(). > > Option 3 is not needed because TD teardown begins during VM destruction > before pages are reclaimed. TD_TEARDOWN state is entered by > tdx_mmu_release_hkid(), whereas pages are reclaimed by tdp_mmu_zap_root(), > as follows: > > kvm_arch_destroy_vm() > ... > vt_vm_pre_destroy() > tdx_mmu_release_hkid() > ... > kvm_mmu_uninit_vm() > kvm_mmu_uninit_tdp_mmu() > kvm_tdp_mmu_invalidate_roots() > kvm_tdp_mmu_zap_invalidated_roots() > tdp_mmu_zap_root() > > == Proof of Concept for option 2 == > > Assume user space never needs to close a guest_memfd except as part of VM > shutdown. > > Add a callback from kvm_gmem_release() to decide whether to defer removal. > For TDX, record the inode (up to a max. of 64 inodes) and pin it. > > Amend the release of guest_memfds to skip removing pages from the MMU > in that case. > > Amend TDX private memory page removal to detect TD_TEARDOWN state, and > reclaim the page accordingly. > > For TDX, finally unpin any pinned inodes. > > This hopefully illustrates what needs to be done, but guidance is sought > for the best way to do it. > > Signed-off-by: Adrian Hunter <adrian.hunter@xxxxxxxxx> > --- > arch/x86/include/asm/kvm-x86-ops.h | 1 + > arch/x86/include/asm/kvm_host.h | 3 ++ > arch/x86/kvm/Kconfig | 1 + > arch/x86/kvm/vmx/main.c | 12 +++++++- > arch/x86/kvm/vmx/tdx.c | 47 +++++++++++++++++++++++++----- > arch/x86/kvm/vmx/tdx.h | 14 +++++++++ > arch/x86/kvm/vmx/x86_ops.h | 2 ++ > arch/x86/kvm/x86.c | 7 +++++ > include/linux/kvm_host.h | 5 ++++ > virt/kvm/Kconfig | 4 +++ > virt/kvm/guest_memfd.c | 26 ++++++++++++----- > 11 files changed, 107 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h > index 79406bf07a1c..e4728f1fe646 100644 > --- a/arch/x86/include/asm/kvm-x86-ops.h > +++ b/arch/x86/include/asm/kvm-x86-ops.h > @@ -148,6 +148,7 @@ KVM_X86_OP_OPTIONAL(alloc_apic_backing_page) > KVM_X86_OP_OPTIONAL_RET0(gmem_prepare) > KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level) > KVM_X86_OP_OPTIONAL(gmem_invalidate) > +KVM_X86_OP_OPTIONAL_RET0(gmem_defer_removal) > > #undef KVM_X86_OP > #undef KVM_X86_OP_OPTIONAL > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 9b9dde476f3c..d1afb4e1c2ee 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1661,6 +1661,8 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical) > return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL; > } > > +struct inode; > + > struct kvm_x86_ops { > const char *name; > > @@ -1888,6 +1890,7 @@ struct kvm_x86_ops { > int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order); > void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end); > int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn); > + int (*gmem_defer_removal)(struct kvm *kvm, struct inode *inode); > }; > > struct kvm_x86_nested_ops { > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 0d445a317f61..32c4b9922e7b 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -96,6 +96,7 @@ config KVM_INTEL > depends on KVM && IA32_FEAT_CTL > select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST > select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST > + select HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST > help > Provides support for KVM on processors equipped with Intel's VT > extensions, a.k.a. Virtual Machine Extensions (VMX). > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c > index 94d5d907d37b..b835006e1282 100644 > --- a/arch/x86/kvm/vmx/main.c > +++ b/arch/x86/kvm/vmx/main.c > @@ -888,6 +888,14 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) > return 0; > } > > +static int vt_gmem_defer_removal(struct kvm *kvm, struct inode *inode) > +{ > + if (is_td(kvm)) > + return tdx_gmem_defer_removal(kvm, inode); > + > + return 0; > +} > + > #define VMX_REQUIRED_APICV_INHIBITS \ > (BIT(APICV_INHIBIT_REASON_DISABLED) | \ > BIT(APICV_INHIBIT_REASON_ABSENT) | \ > @@ -1046,7 +1054,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = { > .mem_enc_ioctl = vt_mem_enc_ioctl, > .vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl, > > - .private_max_mapping_level = vt_gmem_private_max_mapping_level > + .private_max_mapping_level = vt_gmem_private_max_mapping_level, > + > + .gmem_defer_removal = vt_gmem_defer_removal, > }; > > struct kvm_x86_init_ops vt_init_ops __initdata = { > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index d9eb20516c71..51bbb44ac1bd 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -5,6 +5,7 @@ > #include <asm/fpu/xcr.h> > #include <linux/misc_cgroup.h> > #include <linux/mmu_context.h> > +#include <linux/fs.h> > #include <asm/tdx.h> > #include "capabilities.h" > #include "mmu.h" > @@ -594,10 +595,20 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm) > kvm_tdx->td.tdr_page = NULL; > } > > +static void tdx_unpin_inodes(struct kvm *kvm) > +{ > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > + > + for (int i = 0; i < kvm_tdx->nr_gmem_inodes; i++) > + iput(kvm_tdx->gmem_inodes[i]); > +} > + > void tdx_vm_destroy(struct kvm *kvm) > { > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > + tdx_unpin_inodes(kvm); > + > tdx_reclaim_td_control_pages(kvm); > > kvm_tdx->state = TD_STATE_UNINITIALIZED; > @@ -1778,19 +1789,28 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn, > return tdx_reclaim_page(virt_to_page(private_spt)); > } > > +static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page) > +{ > + int ret; > + > + if (level != PG_LEVEL_4K) > + return -EINVAL; > + > + ret = tdx_reclaim_page(page); > + if (!ret) > + tdx_unpin(kvm, page); > + > + return ret; > +} > + > int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn, > enum pg_level level, kvm_pfn_t pfn) > { > struct page *page = pfn_to_page(pfn); > int ret; > > - /* > - * HKID is released after all private pages have been removed, and set > - * before any might be populated. Warn if zapping is attempted when > - * there can't be anything populated in the private EPT. > - */ > - if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm)) > - return -EINVAL; > + if (!is_hkid_assigned(to_kvm_tdx(kvm))) > + return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn)); > > ret = tdx_sept_zap_private_spte(kvm, gfn, level, page); > if (ret <= 0) > @@ -3221,6 +3241,19 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) > return PG_LEVEL_4K; > } > > +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode) > +{ > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > + > + if (kvm_tdx->nr_gmem_inodes >= TDX_MAX_GMEM_INODES) > + return 0; > + > + kvm_tdx->gmem_inodes[kvm_tdx->nr_gmem_inodes++] = inode; > + ihold(inode); > + > + return 1; > +} > + > static int tdx_online_cpu(unsigned int cpu) > { > unsigned long flags; > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h > index 6b3bebebabfa..fb5c4face131 100644 > --- a/arch/x86/kvm/vmx/tdx.h > +++ b/arch/x86/kvm/vmx/tdx.h > @@ -20,6 +20,10 @@ enum kvm_tdx_state { > TD_STATE_RUNNABLE, > }; > > +struct inode; > + > +#define TDX_MAX_GMEM_INODES 64 > + > struct kvm_tdx { > struct kvm kvm; > > @@ -43,6 +47,16 @@ struct kvm_tdx { > * Set/unset is protected with kvm->mmu_lock. > */ > bool wait_for_sept_zap; > + > + /* > + * For pages that will not be removed until TD shutdown, the associated > + * guest_memfd inode is pinned. Allow for a fixed number of pinned > + * inodes. If there are more, then when the guest_memfd is closed, > + * their pages will be removed safely but inefficiently prior to > + * shutdown. > + */ > + struct inode *gmem_inodes[TDX_MAX_GMEM_INODES]; > + int nr_gmem_inodes; > }; > > /* TDX module vCPU states */ > diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h > index 4704bed033b1..4ee123289d85 100644 > --- a/arch/x86/kvm/vmx/x86_ops.h > +++ b/arch/x86/kvm/vmx/x86_ops.h > @@ -164,6 +164,7 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu); > void tdx_flush_tlb_all(struct kvm_vcpu *vcpu); > void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level); > int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn); > +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode); > #else > static inline void tdx_disable_virtualization_cpu(void) {} > static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; } > @@ -229,6 +230,7 @@ static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {} > static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {} > static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {} > static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; } > +static inline int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode) { return 0; } > #endif > > #endif /* __KVM_X86_VMX_X86_OPS_H */ > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 03db366e794a..96ebf0303223 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13659,6 +13659,13 @@ void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) > } > #endif > > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL > +bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode) > +{ > + return kvm_x86_call(gmem_defer_removal)(kvm, inode); > +} > +#endif > + > int kvm_spec_ctrl_test_value(u64 value) > { > /* > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index d72ba0a4ca0e..f5c8b1923c24 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2534,6 +2534,11 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm, > int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order); > #endif > > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL > +struct inode; > +bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode); > +#endif > + > #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > /** > * kvm_gmem_populate() - Populate/prepare a GPA range with guest data > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > index 54e959e7d68f..589111505ad0 100644 > --- a/virt/kvm/Kconfig > +++ b/virt/kvm/Kconfig > @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE > config HAVE_KVM_ARCH_GMEM_INVALIDATE > bool > depends on KVM_PRIVATE_MEM > + > +config HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL > + bool > + depends on KVM_PRIVATE_MEM > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index b2aa6bf24d3a..cd485f45fdaf 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -248,6 +248,15 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, > return ret; > } > > +inline bool kvm_gmem_defer_removal(struct kvm *kvm, struct inode *inode) > +{ > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL > + return kvm_arch_gmem_defer_removal(kvm, inode); > +#else > + return false; > +#endif > +} > + > static int kvm_gmem_release(struct inode *inode, struct file *file) > { > struct kvm_gmem *gmem = file->private_data; > @@ -275,13 +284,16 @@ static int kvm_gmem_release(struct inode *inode, struct file *file) > xa_for_each(&gmem->bindings, index, slot) > WRITE_ONCE(slot->gmem.file, NULL); > > - /* > - * All in-flight operations are gone and new bindings can be created. > - * Zap all SPTEs pointed at by this file. Do not free the backing > - * memory, as its lifetime is associated with the inode, not the file. > - */ > - kvm_gmem_invalidate_begin(gmem, 0, -1ul); > - kvm_gmem_invalidate_end(gmem, 0, -1ul); > + if (!kvm_gmem_defer_removal(kvm, inode)) { > + /* > + * All in-flight operations are gone and new bindings can be > + * created. Zap all SPTEs pointed at by this file. Do not free > + * the backing memory, as its lifetime is associated with the > + * inode, not the file. > + */ > + kvm_gmem_invalidate_begin(gmem, 0, -1ul); > + kvm_gmem_invalidate_end(gmem, 0, -1ul); > + } > > list_del(&gmem->entry); > > -- > 2.43.0 >