On Sat, Jun 22, 2024 at 03:08:22AM +0800, Edgecombe, Rick P wrote: > On Fri, 2024-06-21 at 15:10 +0800, Yan Zhao wrote: > > On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote: > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > > index 630e6b6d4bf2..a1ab67a4f41f 100644 > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) > > > * for zapping and thus puts the TDP MMU's reference to each root, > > > i.e. > > > * ultimately frees all roots. > > > */ > > > - kvm_tdp_mmu_invalidate_all_roots(kvm); > > > + kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS); > > all roots (mirror + direct) are invalidated here. > > Right. > > > > > kvm_tdp_mmu_zap_invalidated_roots(kvm); > > kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with > > mmu_lock held for read, which should trigger KVM_BUG_ON() in > > __tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on > > mirror roots". > > > > But up to now, the KVM_BUG_ON() is not triggered because > > kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in > > below > > call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has > > zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm(). > > > > > > kvm_mmu_notifier_release > > kvm_flush_shadow_all > > kvm_arch_flush_shadow_all > > static_call_cond(kvm_x86_flush_shadow_all_private)(kvm); > > kvm_mmu_zap_all ==>hold mmu_lock for write > > kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write > > > > kvm_destroy_vm > > kvm_arch_destroy_vm > > kvm_mmu_uninit_vm > > kvm_mmu_uninit_tdp_mmu > > kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS > > kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held > > for read > > > > > > A question is that kvm_mmu_notifier_release(), as a callback of primary MMU > > notifier, why does it zap mirrored tdp when all other callbacks are with > > KVM_FILTER_SHARED? > > > > Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in > > kvm_mmu_notifier_release() and move mirrord tdp related stuffs from > > kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is > > held for write? > > Sigh, thanks for flagging this. I agree it seems weird to free private memory > from an MMU notifier callback. I also found this old thread where Sean NAKed the > current approach (free hkid in mmu release): > https://lore.kernel.org/kvm/ZN+1QHGa6ltpQxZn@xxxxxxxxxx/#t > > One challenge is that flush_shadow_all_private() needs to be done before > kvm_destroy_vcpus(), where it gets into tdx_vcpu_free(). So kvm_mmu_uninit_vm() > is too late. Perhaps this is why it was shoved into mmu notifier release (which > happens long before as you noted). Isaku, do you recall any other reasons? > > But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus, so we > could maybe actually just do the tdx_mmu_release_hkid() part there. Then drop > the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff at > the bottom of this mail. It looks good to me. > > But most of what is being discussed is in future patches where it starts to get > into the TDX module interaction. So I wonder if we should drop this patch 17 > from "part 1" and include it with the next series so it can all be considered > together. > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86- > ops.h > index 2adf36b74910..3927731aa947 100644 > --- a/arch/x86/include/asm/kvm-x86-ops.h > +++ b/arch/x86/include/asm/kvm-x86-ops.h > @@ -23,7 +23,6 @@ KVM_X86_OP(has_emulated_msr) > KVM_X86_OP(vcpu_after_set_cpuid) > KVM_X86_OP_OPTIONAL(vm_enable_cap) > KVM_X86_OP(vm_init) > -KVM_X86_OP_OPTIONAL(flush_shadow_all_private) > KVM_X86_OP_OPTIONAL(vm_destroy) > KVM_X86_OP_OPTIONAL(vm_free) > KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate) > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 8a72e5873808..8b2b79b39d0f 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1647,7 +1647,6 @@ struct kvm_x86_ops { > unsigned int vm_size; > int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap); > int (*vm_init)(struct kvm *kvm); > - void (*flush_shadow_all_private)(struct kvm *kvm); > void (*vm_destroy)(struct kvm *kvm); > void (*vm_free)(struct kvm *kvm); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index e1299eb03e63..4deeeac14324 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -6446,7 +6446,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) > * lead to use-after-free. > */ > if (tdp_mmu_enabled) > - kvm_tdp_mmu_zap_invalidated_roots(kvm); > + kvm_tdp_mmu_zap_invalidated_roots(kvm, true); > } > > static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) > @@ -6977,13 +6977,6 @@ static void kvm_mmu_zap_all(struct kvm *kvm) > > void kvm_arch_flush_shadow_all(struct kvm *kvm) > { > - /* > - * kvm_mmu_zap_all() zaps both private and shared page tables. Before > - * tearing down private page tables, TDX requires some TD resources to > - * be destroyed (i.e. keyID must have been reclaimed, etc). Invoke > - * kvm_x86_flush_shadow_all_private() for this. > - */ > - static_call_cond(kvm_x86_flush_shadow_all_private)(kvm); > kvm_mmu_zap_all(kvm); > } > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 68dfcdb46ab7..9e8b012aa8cc 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -38,7 +38,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) > * ultimately frees all roots. > */ > kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS); > - kvm_tdp_mmu_zap_invalidated_roots(kvm); > + kvm_tdp_mmu_zap_invalidated_roots(kvm, false); > > WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages)); > WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots)); > @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) > * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request. > */ > lockdep_assert_held_write(&kvm->mmu_lock); > - for_each_tdp_mmu_root_yield_safe(kvm, root) > + __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS) nit: update the comment of kvm_tdp_mmu_zap_all() and explain why it's KVM_DIRECT_ROOTS, not KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS. > tdp_mmu_zap_root(kvm, root, false); > } > > @@ -1065,11 +1065,14 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) > * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast > * zap" completes. > */ > -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) > +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared) > { > struct kvm_mmu_page *root; > > - read_lock(&kvm->mmu_lock); > + if (shared) > + read_lock(&kvm->mmu_lock); > + else > + write_lock(&kvm->mmu_lock); > > for_each_tdp_mmu_root_yield_safe(kvm, root) { > if (!root->tdp_mmu_scheduled_root_to_zap) > @@ -1087,7 +1090,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) > * that may be zapped, as such entries are associated with the > * ASID on both VMX and SVM. > */ > - tdp_mmu_zap_root(kvm, root, true); > + tdp_mmu_zap_root(kvm, root, shared); > > /* > * The referenced needs to be put *after* zapping the root, as > @@ -1097,7 +1100,10 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) > kvm_tdp_mmu_put_root(kvm, root); > } > > - read_unlock(&kvm->mmu_lock); > + if (shared) > + read_unlock(&kvm->mmu_lock); > + else > + write_unlock(&kvm->mmu_lock); > } > > /* > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > index 56741d31048a..7927fa4a96e0 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.h > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > @@ -68,7 +68,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page > *sp); > void kvm_tdp_mmu_zap_all(struct kvm *kvm); > void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm, > enum kvm_tdp_mmu_root_types root_types); > -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm); > +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared); > > int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c > index b6828e35eb17..3f9bfcd3e152 100644 > --- a/arch/x86/kvm/vmx/main.c > +++ b/arch/x86/kvm/vmx/main.c > @@ -98,16 +98,12 @@ static int vt_vm_init(struct kvm *kvm) > return vmx_vm_init(kvm); > } > > -static void vt_flush_shadow_all_private(struct kvm *kvm) > -{ > - if (is_td(kvm)) > - tdx_mmu_release_hkid(kvm); > -} > - > static void vt_vm_destroy(struct kvm *kvm) > { > - if (is_td(kvm)) > + if (is_td(kvm)) { > + tdx_mmu_release_hkid(kvm); > return; > + } > > vmx_vm_destroy(kvm); > } > @@ -980,7 +976,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = { > .vm_size = sizeof(struct kvm_vmx), > .vm_enable_cap = vt_vm_enable_cap, > .vm_init = vt_vm_init, > - .flush_shadow_all_private = vt_flush_shadow_all_private, > .vm_destroy = vt_vm_destroy, > .vm_free = vt_vm_free, > >