Gah, sorry Greg, forgot to say that this is for 6.1-stable. On Thu, Oct 19, 2023, Sean Christopherson wrote: > [ Upstream commit 0df9dab891ff0d9b646d82e4fe038229e4c02451 ] > > Stop zapping invalidate TDP MMU roots via work queue now that KVM > preserves TDP MMU roots until they are explicitly invalidated. Zapping > roots asynchronously was effectively a workaround to avoid stalling a vCPU > for an extended during if a vCPU unloaded a root, which at the time > happened whenever the guest toggled CR0.WP (a frequent operation for some > guest kernels). > > While a clever hack, zapping roots via an unbound worker had subtle, > unintended consequences on host scheduling, especially when zapping > multiple roots, e.g. as part of a memslot. Because the work of zapping a > root is no longer bound to the task that initiated the zap, things like > the CPU affinity and priority of the original task get lost. Losing the > affinity and priority can be especially problematic if unbound workqueues > aren't affined to a small number of CPUs, as zapping multiple roots can > cause KVM to heavily utilize the majority of CPUs in the system, *beyond* > the CPUs KVM is already using to run vCPUs. > > When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root > zap can result in KVM occupying all logical CPUs for ~8ms, and result in > high priority tasks not being scheduled in in a timely manner. In v5.15, > which doesn't preserve unloaded roots, the issues were even more noticeable > as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. > > Consuming all CPUs for an extended duration can lead to significant jitter > throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots > is a semi-frequent operation as memslots are deleted and recreated with > different host virtual addresses to react to host GPU drivers allocating > and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips > during games due to the audio server's tasks not getting scheduled in > promptly, despite the tasks having a high realtime priority. > > Deleting memslots isn't exactly a fast path and should be avoided when > possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the > memslot shenanigans, but KVM is squarely in the wrong. Not to mention > that removing the async zapping eliminates a non-trivial amount of > complexity. > > Note, one of the subtle behaviors hidden behind the async zapping is that > KVM would zap invalidated roots only once (ignoring partial zaps from > things like mmu_notifier events). Preserve this behavior by adding a flag > to identify roots that are scheduled to be zapped versus roots that have > already been zapped but not yet freed. > > Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can > encounter invalid roots, as it's not at all obvious why zapping > invalidated roots shouldn't simply zap all invalid roots. > > Reported-by: Pattara Teerapong <pteerapong@xxxxxxxxxx> > Cc: David Stevens <stevensd@xxxxxxxxxx> > Cc: Yiwei Zhang<zzyiwei@xxxxxxxxxx> > Cc: Paul Hsia <paulhsia@xxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > Message-Id: <20230916003916.2545000-4-seanjc@xxxxxxxxxx> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Cc: David Matlack <dmatlack@xxxxxxxxxx> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > > Folks on Cc, it would be nice to get extra testing and/or reviews for this one > before it's picked up for 6.1, there were quite a few conflicts to resolve. > All of the conflicts were pretty straightforward, but I'd still appreciate an > extra set of eyeballs or three. Thanks! > > arch/x86/include/asm/kvm_host.h | 3 +- > arch/x86/kvm/mmu/mmu.c | 9 +-- > arch/x86/kvm/mmu/mmu_internal.h | 15 ++-- > arch/x86/kvm/mmu/tdp_mmu.c | 135 +++++++++++++------------------- > arch/x86/kvm/mmu/tdp_mmu.h | 4 +- > arch/x86/kvm/x86.c | 5 +- > 6 files changed, 69 insertions(+), 102 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 08a84f801bfe..c1dcaa3d2d6e 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1324,7 +1324,6 @@ struct kvm_arch { > * the thread holds the MMU lock in write mode. > */ > spinlock_t tdp_mmu_pages_lock; > - struct workqueue_struct *tdp_mmu_zap_wq; > #endif /* CONFIG_X86_64 */ > > /* > @@ -1727,7 +1726,7 @@ void kvm_mmu_vendor_module_exit(void); > > void kvm_mmu_destroy(struct kvm_vcpu *vcpu); > int kvm_mmu_create(struct kvm_vcpu *vcpu); > -int kvm_mmu_init_vm(struct kvm *kvm); > +void kvm_mmu_init_vm(struct kvm *kvm); > void kvm_mmu_uninit_vm(struct kvm *kvm); > > void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu); > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 2a6fec4e2d19..d30325e297a0 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5994,19 +5994,16 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, > kvm_mmu_zap_all_fast(kvm); > } > > -int kvm_mmu_init_vm(struct kvm *kvm) > +void kvm_mmu_init_vm(struct kvm *kvm) > { > struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker; > - int r; > > INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); > INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages); > INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages); > spin_lock_init(&kvm->arch.mmu_unsync_pages_lock); > > - r = kvm_mmu_init_tdp_mmu(kvm); > - if (r < 0) > - return r; > + kvm_mmu_init_tdp_mmu(kvm); > > node->track_write = kvm_mmu_pte_write; > node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot; > @@ -6019,8 +6016,6 @@ int kvm_mmu_init_vm(struct kvm *kvm) > > kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache; > kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO; > - > - return 0; > } > > static void mmu_free_vm_memory_caches(struct kvm *kvm) > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 582def531d4d..0a9d5f2925c3 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -56,7 +56,12 @@ struct kvm_mmu_page { > > bool tdp_mmu_page; > bool unsync; > - u8 mmu_valid_gen; > + union { > + u8 mmu_valid_gen; > + > + /* Only accessed under slots_lock. */ > + bool tdp_mmu_scheduled_root_to_zap; > + }; > bool lpage_disallowed; /* Can't be replaced by an equiv large page */ > > /* > @@ -92,13 +97,7 @@ struct kvm_mmu_page { > struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ > tdp_ptep_t ptep; > }; > - union { > - DECLARE_BITMAP(unsync_child_bitmap, 512); > - struct { > - struct work_struct tdp_mmu_async_work; > - void *tdp_mmu_async_data; > - }; > - }; > + DECLARE_BITMAP(unsync_child_bitmap, 512); > > struct list_head lpage_disallowed_link; > #ifdef CONFIG_X86_32 > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 9b9fc4e834d0..c3b0f973375b 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -14,24 +14,16 @@ static bool __read_mostly tdp_mmu_enabled = true; > module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644); > > /* Initializes the TDP MMU for the VM, if enabled. */ > -int kvm_mmu_init_tdp_mmu(struct kvm *kvm) > +void kvm_mmu_init_tdp_mmu(struct kvm *kvm) > { > - struct workqueue_struct *wq; > - > if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)) > - return 0; > - > - wq = alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0); > - if (!wq) > - return -ENOMEM; > + return; > > /* This should not be changed for the lifetime of the VM. */ > kvm->arch.tdp_mmu_enabled = true; > INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots); > spin_lock_init(&kvm->arch.tdp_mmu_pages_lock); > INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages); > - kvm->arch.tdp_mmu_zap_wq = wq; > - return 1; > } > > /* Arbitrarily returns true so that this may be used in if statements. */ > @@ -57,20 +49,15 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) > * ultimately frees all roots. > */ > kvm_tdp_mmu_invalidate_all_roots(kvm); > - > - /* > - * Destroying a workqueue also first flushes the workqueue, i.e. no > - * need to invoke kvm_tdp_mmu_zap_invalidated_roots(). > - */ > - destroy_workqueue(kvm->arch.tdp_mmu_zap_wq); > + kvm_tdp_mmu_zap_invalidated_roots(kvm); > > WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages)); > WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots)); > > /* > * Ensure that all the outstanding RCU callbacks to free shadow pages > - * can run before the VM is torn down. Work items on tdp_mmu_zap_wq > - * can call kvm_tdp_mmu_put_root and create new callbacks. > + * can run before the VM is torn down. Putting the last reference to > + * zapped roots will create new callbacks. > */ > rcu_barrier(); > } > @@ -97,46 +84,6 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head) > tdp_mmu_free_sp(sp); > } > > -static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, > - bool shared); > - > -static void tdp_mmu_zap_root_work(struct work_struct *work) > -{ > - struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page, > - tdp_mmu_async_work); > - struct kvm *kvm = root->tdp_mmu_async_data; > - > - read_lock(&kvm->mmu_lock); > - > - /* > - * A TLB flush is not necessary as KVM performs a local TLB flush when > - * allocating a new root (see kvm_mmu_load()), and when migrating vCPU > - * to a different pCPU. Note, the local TLB flush on reuse also > - * invalidates any paging-structure-cache entries, i.e. TLB entries for > - * intermediate paging structures, that may be zapped, as such entries > - * are associated with the ASID on both VMX and SVM. > - */ > - tdp_mmu_zap_root(kvm, root, true); > - > - /* > - * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for > - * avoiding an infinite loop. By design, the root is reachable while > - * it's being asynchronously zapped, thus a different task can put its > - * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an > - * asynchronously zapped root is unavoidable. > - */ > - kvm_tdp_mmu_put_root(kvm, root, true); > - > - read_unlock(&kvm->mmu_lock); > -} > - > -static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root) > -{ > - root->tdp_mmu_async_data = kvm; > - INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work); > - queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work); > -} > - > void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, > bool shared) > { > @@ -222,11 +169,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, > #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \ > __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true) > > -#define for_each_tdp_mmu_root_yield_safe(_kvm, _root) \ > - for (_root = tdp_mmu_next_root(_kvm, NULL, false, false); \ > +#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _shared) \ > + for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, false); \ > _root; \ > - _root = tdp_mmu_next_root(_kvm, _root, false, false)) \ > - if (!kvm_lockdep_assert_mmu_lock_held(_kvm, false)) { \ > + _root = tdp_mmu_next_root(_kvm, _root, _shared, false)) \ > + if (!kvm_lockdep_assert_mmu_lock_held(_kvm, _shared)) { \ > } else > > /* > @@ -305,7 +252,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) > * by a memslot update or by the destruction of the VM. Initialize the > * refcount to two; one reference for the vCPU, and one reference for > * the TDP MMU itself, which is held until the root is invalidated and > - * is ultimately put by tdp_mmu_zap_root_work(). > + * is ultimately put by kvm_tdp_mmu_zap_invalidated_roots(). > */ > refcount_set(&root->tdp_mmu_root_count, 2); > > @@ -963,7 +910,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush) > { > struct kvm_mmu_page *root; > > - for_each_tdp_mmu_root_yield_safe(kvm, root) > + for_each_tdp_mmu_root_yield_safe(kvm, root, false) > flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush); > > return flush; > @@ -985,7 +932,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) > * is being destroyed or the userspace VMM has exited. In both cases, > * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request. > */ > - for_each_tdp_mmu_root_yield_safe(kvm, root) > + for_each_tdp_mmu_root_yield_safe(kvm, root, false) > tdp_mmu_zap_root(kvm, root, false); > } > > @@ -995,18 +942,47 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) > */ > void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) > { > - flush_workqueue(kvm->arch.tdp_mmu_zap_wq); > + struct kvm_mmu_page *root; > + > + read_lock(&kvm->mmu_lock); > + > + for_each_tdp_mmu_root_yield_safe(kvm, root, true) { > + if (!root->tdp_mmu_scheduled_root_to_zap) > + continue; > + > + root->tdp_mmu_scheduled_root_to_zap = false; > + KVM_BUG_ON(!root->role.invalid, kvm); > + > + /* > + * A TLB flush is not necessary as KVM performs a local TLB > + * flush when allocating a new root (see kvm_mmu_load()), and > + * when migrating a vCPU to a different pCPU. Note, the local > + * TLB flush on reuse also invalidates paging-structure-cache > + * entries, i.e. TLB entries for intermediate paging structures, > + * that may be zapped, as such entries are associated with the > + * ASID on both VMX and SVM. > + */ > + tdp_mmu_zap_root(kvm, root, true); > + > + /* > + * The referenced needs to be put *after* zapping the root, as > + * the root must be reachable by mmu_notifiers while it's being > + * zapped > + */ > + kvm_tdp_mmu_put_root(kvm, root, true); > + } > + > + read_unlock(&kvm->mmu_lock); > } > > /* > * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that > * is about to be zapped, e.g. in response to a memslots update. The actual > - * zapping is performed asynchronously. Using a separate workqueue makes it > - * easy to ensure that the destruction is performed before the "fast zap" > - * completes, without keeping a separate list of invalidated roots; the list is > - * effectively the list of work items in the workqueue. > + * zapping is done separately so that it happens with mmu_lock with read, > + * whereas invalidating roots must be done with mmu_lock held for write (unless > + * the VM is being destroyed). > * > - * Note, the asynchronous worker is gifted the TDP MMU's reference. > + * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference. > * See kvm_tdp_mmu_get_vcpu_root_hpa(). > */ > void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm) > @@ -1031,19 +1007,20 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm) > /* > * As above, mmu_lock isn't held when destroying the VM! There can't > * be other references to @kvm, i.e. nothing else can invalidate roots > - * or be consuming roots, but walking the list of roots does need to be > - * guarded against roots being deleted by the asynchronous zap worker. > + * or get/put references to roots. > */ > - rcu_read_lock(); > - > - list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) { > + list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) { > + /* > + * Note, invalid roots can outlive a memslot update! Invalid > + * roots must be *zapped* before the memslot update completes, > + * but a different task can acquire a reference and keep the > + * root alive after its been zapped. > + */ > if (!root->role.invalid) { > + root->tdp_mmu_scheduled_root_to_zap = true; > root->role.invalid = true; > - tdp_mmu_schedule_zap_root(kvm, root); > } > } > - > - rcu_read_unlock(); > } > > /* > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > index d0a9fe0770fd..c82a8bb321bb 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.h > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > @@ -65,7 +65,7 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr, > u64 *spte); > > #ifdef CONFIG_X86_64 > -int kvm_mmu_init_tdp_mmu(struct kvm *kvm); > +void kvm_mmu_init_tdp_mmu(struct kvm *kvm); > void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm); > static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; } > > @@ -86,7 +86,7 @@ static inline bool is_tdp_mmu(struct kvm_mmu *mmu) > return sp && is_tdp_mmu_page(sp) && sp->root_count; > } > #else > -static inline int kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return 0; } > +static inline void kvm_mmu_init_tdp_mmu(struct kvm *kvm) {} > static inline void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) {} > static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; } > static inline bool is_tdp_mmu(struct kvm_mmu *mmu) { return false; } > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 1931d3fcbbe0..b929254c7876 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -12442,9 +12442,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) > if (ret) > goto out; > > - ret = kvm_mmu_init_vm(kvm); > - if (ret) > - goto out_page_track; > + kvm_mmu_init_vm(kvm); > > ret = static_call(kvm_x86_vm_init)(kvm); > if (ret) > @@ -12489,7 +12487,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) > > out_uninit_mmu: > kvm_mmu_uninit_vm(kvm); > -out_page_track: > kvm_page_track_cleanup(kvm); > out: > return ret; > > base-commit: adc4d740ad9ec780657327c69ab966fa4fdf0e8e > -- > 2.42.0.655.g421f12c284-goog >