On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > Zap defunct roots, a.k.a. roots that have been invalidated after their > last reference was initially dropped, asynchronously via the system work > queue instead of forcing the work upon the unfortunate task that happened > to drop the last reference. > > If a vCPU task drops the last reference, the vCPU is effectively blocked > by the host for the entire duration of the zap. If the root being zapped > happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging > being active, the zap can take several hundred seconds. Unsurprisingly, > most guests are unhappy if a vCPU disappears for hundreds of seconds. > > E.g. running a synthetic selftest that triggers a vCPU root zap with > ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds. > Offloading the zap to a worker drops the block time to <100ms. > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> Reviewed-by: Ben Gardon <bgardon@xxxxxxxxxx> > --- > arch/x86/kvm/mmu/mmu_internal.h | 8 +++- > arch/x86/kvm/mmu/tdp_mmu.c | 65 ++++++++++++++++++++++++++++----- > 2 files changed, 63 insertions(+), 10 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index be063b6c91b7..1bff453f7cbe 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -65,7 +65,13 @@ struct kvm_mmu_page { > struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ > tdp_ptep_t ptep; > }; > - DECLARE_BITMAP(unsync_child_bitmap, 512); > + union { > + DECLARE_BITMAP(unsync_child_bitmap, 512); > + struct { > + struct work_struct tdp_mmu_async_work; > + void *tdp_mmu_async_data; > + }; > + }; At some point (probably not in this series since it's so long already) it would be good to organize kvm_mmu_page. It looks like we've got quite a few anonymous unions in there for TDP / Shadow MMU fields. > > struct list_head lpage_disallowed_link; > #ifdef CONFIG_X86_32 > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index ec28a88c6376..4151e61245a7 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -81,6 +81,38 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head) > static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, > bool shared); > > +static void tdp_mmu_zap_root_async(struct work_struct *work) > +{ > + struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page, > + tdp_mmu_async_work); > + struct kvm *kvm = root->tdp_mmu_async_data; > + > + read_lock(&kvm->mmu_lock); > + > + /* > + * A TLB flush is not necessary as KVM performs a local TLB flush when > + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU > + * to a different pCPU. Note, the local TLB flush on reuse also > + * invalidates any paging-structure-cache entries, i.e. TLB entries for > + * intermediate paging structures, that may be zapped, as such entries > + * are associated with the ASID on both VMX and SVM. > + */ > + tdp_mmu_zap_root(kvm, root, true); > + > + /* > + * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for > + * avoiding an infinite loop. By design, the root is reachable while > + * it's being asynchronously zapped, thus a different task can put its > + * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an > + * asynchronously zapped root is unavoidable. > + */ > + kvm_tdp_mmu_put_root(kvm, root, true); > + > + read_unlock(&kvm->mmu_lock); > + > + kvm_put_kvm(kvm); > +} > + > void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, > bool shared) > { > @@ -142,15 +174,26 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, > refcount_set(&root->tdp_mmu_root_count, 1); > > /* > - * Zap the root, then put the refcount "acquired" above. Recursively > - * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an > - * infinite loop by freeing invalid roots. By design, the root is > - * reachable while it's being zapped, thus a different task can put its > - * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a > - * defunct root is unavoidable. > + * Attempt to acquire a reference to KVM itself. If KVM is alive, then > + * zap the root asynchronously in a worker, otherwise it must be zapped > + * directly here. Wait to do this check until after the refcount is > + * reset so that tdp_mmu_zap_root() can safely yield. > + * > + * In both flows, zap the root, then put the refcount "acquired" above. > + * When putting the reference, use kvm_tdp_mmu_put_root() to test the > + * above logic for avoiding an infinite loop by freeing invalid roots. > + * By design, the root is reachable while it's being zapped, thus a > + * different task can put its last reference, i.e. flowing through > + * kvm_tdp_mmu_put_root() for a defunct root is unavoidable. > */ > - tdp_mmu_zap_root(kvm, root, shared); > - kvm_tdp_mmu_put_root(kvm, root, shared); > + if (kvm_get_kvm_safe(kvm)) { > + root->tdp_mmu_async_data = kvm; > + INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_async); > + schedule_work(&root->tdp_mmu_async_work); > + } else { > + tdp_mmu_zap_root(kvm, root, shared); > + kvm_tdp_mmu_put_root(kvm, root, shared); > + } > } > > enum tdp_mmu_roots_iter_type { > @@ -954,7 +997,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) > > /* > * Zap all roots, including invalid roots, as all SPTEs must be dropped > - * before returning to the caller. > + * before returning to the caller. Zap directly even if the root is > + * also being zapped by a worker. Walking zapped top-level SPTEs isn't > + * all that expensive and mmu_lock is already held, which means the > + * worker has yielded, i.e. flushing the work instead of zapping here > + * isn't guaranteed to be any faster. > * > * A TLB flush is unnecessary, KVM zaps everything if and only the VM > * is being destroyed or the userspace VMM has exited. In both cases, > -- > 2.35.1.574.g5d30c73bfb-goog >