Re: [PATCH] KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously

Sean Christopherson <seanjc@xxxxxxxxxx> · Thu, 19 Oct 2023 21:55:32 +0000

Gah, sorry Greg, forgot to say that this is for 6.1-stable.

On Thu, Oct 19, 2023, Sean Christopherson wrote:
> [ Upstream commit 0df9dab891ff0d9b646d82e4fe038229e4c02451 ]
> 
> Stop zapping invalidate TDP MMU roots via work queue now that KVM
> preserves TDP MMU roots until they are explicitly invalidated.  Zapping
> roots asynchronously was effectively a workaround to avoid stalling a vCPU
> for an extended during if a vCPU unloaded a root, which at the time
> happened whenever the guest toggled CR0.WP (a frequent operation for some
> guest kernels).
> 
> While a clever hack, zapping roots via an unbound worker had subtle,
> unintended consequences on host scheduling, especially when zapping
> multiple roots, e.g. as part of a memslot.  Because the work of zapping a
> root is no longer bound to the task that initiated the zap, things like
> the CPU affinity and priority of the original task get lost.  Losing the
> affinity and priority can be especially problematic if unbound workqueues
> aren't affined to a small number of CPUs, as zapping multiple roots can
> cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
> the CPUs KVM is already using to run vCPUs.
> 
> When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
> zap can result in KVM occupying all logical CPUs for ~8ms, and result in
> high priority tasks not being scheduled in in a timely manner.  In v5.15,
> which doesn't preserve unloaded roots, the issues were even more noticeable
> as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.
> 
> Consuming all CPUs for an extended duration can lead to significant jitter
> throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
> is a semi-frequent operation as memslots are deleted and recreated with
> different host virtual addresses to react to host GPU drivers allocating
> and freeing GPU blobs.  On ChromeOS, the jitter manifests as audio blips
> during games due to the audio server's tasks not getting scheduled in
> promptly, despite the tasks having a high realtime priority.
> 
> Deleting memslots isn't exactly a fast path and should be avoided when
> possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
> memslot shenanigans, but KVM is squarely in the wrong.  Not to mention
> that removing the async zapping eliminates a non-trivial amount of
> complexity.
> 
> Note, one of the subtle behaviors hidden behind the async zapping is that
> KVM would zap invalidated roots only once (ignoring partial zaps from
> things like mmu_notifier events).  Preserve this behavior by adding a flag
> to identify roots that are scheduled to be zapped versus roots that have
> already been zapped but not yet freed.
> 
> Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
> encounter invalid roots, as it's not at all obvious why zapping
> invalidated roots shouldn't simply zap all invalid roots.
> 
> Reported-by: Pattara Teerapong <pteerapong@xxxxxxxxxx>
> Cc: David Stevens <stevensd@xxxxxxxxxx>
> Cc: Yiwei Zhang<zzyiwei@xxxxxxxxxx>
> Cc: Paul Hsia <paulhsia@xxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> Message-Id: <20230916003916.2545000-4-seanjc@xxxxxxxxxx>
> Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Cc: David Matlack <dmatlack@xxxxxxxxxx>
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
> 
> Folks on Cc, it would be nice to get extra testing and/or reviews for this one
> before it's picked up for 6.1, there were quite a few conflicts to resolve.
> All of the conflicts were pretty straightforward, but I'd still appreciate an
> extra set of eyeballs or three.  Thanks! 
> 
>  arch/x86/include/asm/kvm_host.h |   3 +-
>  arch/x86/kvm/mmu/mmu.c          |   9 +--
>  arch/x86/kvm/mmu/mmu_internal.h |  15 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c      | 135 +++++++++++++-------------------
>  arch/x86/kvm/mmu/tdp_mmu.h      |   4 +-
>  arch/x86/kvm/x86.c              |   5 +-
>  6 files changed, 69 insertions(+), 102 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 08a84f801bfe..c1dcaa3d2d6e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1324,7 +1324,6 @@ struct kvm_arch {
>  	 * the thread holds the MMU lock in write mode.
>  	 */
>  	spinlock_t tdp_mmu_pages_lock;
> -	struct workqueue_struct *tdp_mmu_zap_wq;
>  #endif /* CONFIG_X86_64 */
>  
>  	/*
> @@ -1727,7 +1726,7 @@ void kvm_mmu_vendor_module_exit(void);
>  
>  void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
>  int kvm_mmu_create(struct kvm_vcpu *vcpu);
> -int kvm_mmu_init_vm(struct kvm *kvm);
> +void kvm_mmu_init_vm(struct kvm *kvm);
>  void kvm_mmu_uninit_vm(struct kvm *kvm);
>  
>  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2a6fec4e2d19..d30325e297a0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5994,19 +5994,16 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  	kvm_mmu_zap_all_fast(kvm);
>  }
>  
> -int kvm_mmu_init_vm(struct kvm *kvm)
> +void kvm_mmu_init_vm(struct kvm *kvm)
>  {
>  	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> -	int r;
>  
>  	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>  	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
>  	INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
>  	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
>  
> -	r = kvm_mmu_init_tdp_mmu(kvm);
> -	if (r < 0)
> -		return r;
> +	kvm_mmu_init_tdp_mmu(kvm);
>  
>  	node->track_write = kvm_mmu_pte_write;
>  	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> @@ -6019,8 +6016,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>  
>  	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
>  	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> -
> -	return 0;
>  }
>  
>  static void mmu_free_vm_memory_caches(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 582def531d4d..0a9d5f2925c3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -56,7 +56,12 @@ struct kvm_mmu_page {
>  
>  	bool tdp_mmu_page;
>  	bool unsync;
> -	u8 mmu_valid_gen;
> +	union {
> +		u8 mmu_valid_gen;
> +
> +		/* Only accessed under slots_lock.  */
> +		bool tdp_mmu_scheduled_root_to_zap;
> +	};
>  	bool lpage_disallowed; /* Can't be replaced by an equiv large page */
>  
>  	/*
> @@ -92,13 +97,7 @@ struct kvm_mmu_page {
>  		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
>  		tdp_ptep_t ptep;
>  	};
> -	union {
> -		DECLARE_BITMAP(unsync_child_bitmap, 512);
> -		struct {
> -			struct work_struct tdp_mmu_async_work;
> -			void *tdp_mmu_async_data;
> -		};
> -	};
> +	DECLARE_BITMAP(unsync_child_bitmap, 512);
>  
>  	struct list_head lpage_disallowed_link;
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9b9fc4e834d0..c3b0f973375b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -14,24 +14,16 @@ static bool __read_mostly tdp_mmu_enabled = true;
>  module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
>  
>  /* Initializes the TDP MMU for the VM, if enabled. */
> -int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> +void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>  {
> -	struct workqueue_struct *wq;
> -
>  	if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
> -		return 0;
> -
> -	wq = alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
> -	if (!wq)
> -		return -ENOMEM;
> +		return;
>  
>  	/* This should not be changed for the lifetime of the VM. */
>  	kvm->arch.tdp_mmu_enabled = true;
>  	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
>  	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
>  	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
> -	kvm->arch.tdp_mmu_zap_wq = wq;
> -	return 1;
>  }
>  
>  /* Arbitrarily returns true so that this may be used in if statements. */
> @@ -57,20 +49,15 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  	 * ultimately frees all roots.
>  	 */
>  	kvm_tdp_mmu_invalidate_all_roots(kvm);
> -
> -	/*
> -	 * Destroying a workqueue also first flushes the workqueue, i.e. no
> -	 * need to invoke kvm_tdp_mmu_zap_invalidated_roots().
> -	 */
> -	destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
> +	kvm_tdp_mmu_zap_invalidated_roots(kvm);
>  
>  	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
>  	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
>  
>  	/*
>  	 * Ensure that all the outstanding RCU callbacks to free shadow pages
> -	 * can run before the VM is torn down.  Work items on tdp_mmu_zap_wq
> -	 * can call kvm_tdp_mmu_put_root and create new callbacks.
> +	 * can run before the VM is torn down.  Putting the last reference to
> +	 * zapped roots will create new callbacks.
>  	 */
>  	rcu_barrier();
>  }
> @@ -97,46 +84,6 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
>  	tdp_mmu_free_sp(sp);
>  }
>  
> -static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> -			     bool shared);
> -
> -static void tdp_mmu_zap_root_work(struct work_struct *work)
> -{
> -	struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
> -						 tdp_mmu_async_work);
> -	struct kvm *kvm = root->tdp_mmu_async_data;
> -
> -	read_lock(&kvm->mmu_lock);
> -
> -	/*
> -	 * A TLB flush is not necessary as KVM performs a local TLB flush when
> -	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> -	 * to a different pCPU.  Note, the local TLB flush on reuse also
> -	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
> -	 * intermediate paging structures, that may be zapped, as such entries
> -	 * are associated with the ASID on both VMX and SVM.
> -	 */
> -	tdp_mmu_zap_root(kvm, root, true);
> -
> -	/*
> -	 * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
> -	 * avoiding an infinite loop.  By design, the root is reachable while
> -	 * it's being asynchronously zapped, thus a different task can put its
> -	 * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
> -	 * asynchronously zapped root is unavoidable.
> -	 */
> -	kvm_tdp_mmu_put_root(kvm, root, true);
> -
> -	read_unlock(&kvm->mmu_lock);
> -}
> -
> -static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
> -{
> -	root->tdp_mmu_async_data = kvm;
> -	INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
> -	queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
> -}
> -
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  			  bool shared)
>  {
> @@ -222,11 +169,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>  #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared)	\
>  	__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
>  
> -#define for_each_tdp_mmu_root_yield_safe(_kvm, _root)			\
> -	for (_root = tdp_mmu_next_root(_kvm, NULL, false, false);		\
> +#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _shared)			\
> +	for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, false);		\
>  	     _root;								\
> -	     _root = tdp_mmu_next_root(_kvm, _root, false, false))		\
> -		if (!kvm_lockdep_assert_mmu_lock_held(_kvm, false)) {		\
> +	     _root = tdp_mmu_next_root(_kvm, _root, _shared, false))		\
> +		if (!kvm_lockdep_assert_mmu_lock_held(_kvm, _shared)) {		\
>  		} else
>  
>  /*
> @@ -305,7 +252,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>  	 * by a memslot update or by the destruction of the VM.  Initialize the
>  	 * refcount to two; one reference for the vCPU, and one reference for
>  	 * the TDP MMU itself, which is held until the root is invalidated and
> -	 * is ultimately put by tdp_mmu_zap_root_work().
> +	 * is ultimately put by kvm_tdp_mmu_zap_invalidated_roots().
>  	 */
>  	refcount_set(&root->tdp_mmu_root_count, 2);
>  
> @@ -963,7 +910,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
>  {
>  	struct kvm_mmu_page *root;
>  
> -	for_each_tdp_mmu_root_yield_safe(kvm, root)
> +	for_each_tdp_mmu_root_yield_safe(kvm, root, false)
>  		flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
>  
>  	return flush;
> @@ -985,7 +932,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>  	 * is being destroyed or the userspace VMM has exited.  In both cases,
>  	 * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
>  	 */
> -	for_each_tdp_mmu_root_yield_safe(kvm, root)
> +	for_each_tdp_mmu_root_yield_safe(kvm, root, false)
>  		tdp_mmu_zap_root(kvm, root, false);
>  }
>  
> @@ -995,18 +942,47 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>   */
>  void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>  {
> -	flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
> +	struct kvm_mmu_page *root;
> +
> +	read_lock(&kvm->mmu_lock);
> +
> +	for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
> +		if (!root->tdp_mmu_scheduled_root_to_zap)
> +			continue;
> +
> +		root->tdp_mmu_scheduled_root_to_zap = false;
> +		KVM_BUG_ON(!root->role.invalid, kvm);
> +
> +		/*
> +		 * A TLB flush is not necessary as KVM performs a local TLB
> +		 * flush when allocating a new root (see kvm_mmu_load()), and
> +		 * when migrating a vCPU to a different pCPU.  Note, the local
> +		 * TLB flush on reuse also invalidates paging-structure-cache
> +		 * entries, i.e. TLB entries for intermediate paging structures,
> +		 * that may be zapped, as such entries are associated with the
> +		 * ASID on both VMX and SVM.
> +		 */
> +		tdp_mmu_zap_root(kvm, root, true);
> +
> +		/*
> +		 * The referenced needs to be put *after* zapping the root, as
> +		 * the root must be reachable by mmu_notifiers while it's being
> +		 * zapped
> +		 */
> +		kvm_tdp_mmu_put_root(kvm, root, true);
> +	}
> +
> +	read_unlock(&kvm->mmu_lock);
>  }
>  
>  /*
>   * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
>   * is about to be zapped, e.g. in response to a memslots update.  The actual
> - * zapping is performed asynchronously.  Using a separate workqueue makes it
> - * easy to ensure that the destruction is performed before the "fast zap"
> - * completes, without keeping a separate list of invalidated roots; the list is
> - * effectively the list of work items in the workqueue.
> + * zapping is done separately so that it happens with mmu_lock with read,
> + * whereas invalidating roots must be done with mmu_lock held for write (unless
> + * the VM is being destroyed).
>   *
> - * Note, the asynchronous worker is gifted the TDP MMU's reference.
> + * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
>   * See kvm_tdp_mmu_get_vcpu_root_hpa().
>   */
>  void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> @@ -1031,19 +1007,20 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
>  	/*
>  	 * As above, mmu_lock isn't held when destroying the VM!  There can't
>  	 * be other references to @kvm, i.e. nothing else can invalidate roots
> -	 * or be consuming roots, but walking the list of roots does need to be
> -	 * guarded against roots being deleted by the asynchronous zap worker.
> +	 * or get/put references to roots.
>  	 */
> -	rcu_read_lock();
> -
> -	list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
> +	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> +		/*
> +		 * Note, invalid roots can outlive a memslot update!  Invalid
> +		 * roots must be *zapped* before the memslot update completes,
> +		 * but a different task can acquire a reference and keep the
> +		 * root alive after its been zapped.
> +		 */
>  		if (!root->role.invalid) {
> +			root->tdp_mmu_scheduled_root_to_zap = true;
>  			root->role.invalid = true;
> -			tdp_mmu_schedule_zap_root(kvm, root);
>  		}
>  	}
> -
> -	rcu_read_unlock();
>  }
>  
>  /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index d0a9fe0770fd..c82a8bb321bb 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -65,7 +65,7 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
>  					u64 *spte);
>  
>  #ifdef CONFIG_X86_64
> -int kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> +void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
>  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
>  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }
>  
> @@ -86,7 +86,7 @@ static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
>  	return sp && is_tdp_mmu_page(sp) && sp->root_count;
>  }
>  #else
> -static inline int kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return 0; }
> +static inline void kvm_mmu_init_tdp_mmu(struct kvm *kvm) {}
>  static inline void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) {}
>  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
>  static inline bool is_tdp_mmu(struct kvm_mmu *mmu) { return false; }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1931d3fcbbe0..b929254c7876 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12442,9 +12442,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	if (ret)
>  		goto out;
>  
> -	ret = kvm_mmu_init_vm(kvm);
> -	if (ret)
> -		goto out_page_track;
> +	kvm_mmu_init_vm(kvm);
>  
>  	ret = static_call(kvm_x86_vm_init)(kvm);
>  	if (ret)
> @@ -12489,7 +12487,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>  out_uninit_mmu:
>  	kvm_mmu_uninit_vm(kvm);
> -out_page_track:
>  	kvm_page_track_cleanup(kvm);
>  out:
>  	return ret;
> 
> base-commit: adc4d740ad9ec780657327c69ab966fa4fdf0e8e
> -- 
> 2.42.0.655.g421f12c284-goog
>