On Fri, Mar 04, 2022, Paolo Bonzini wrote: > On 3/4/22 17:02, Sean Christopherson wrote: > > On Fri, Mar 04, 2022, Paolo Bonzini wrote: > > > On 3/3/22 22:32, Sean Christopherson wrote: > > > I didn't remove the paragraph from the commit message, but I think it's > > > unnecessary now. The workqueue is flushed in kvm_mmu_zap_all_fast() and > > > kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take > > > a reference to the VM. > > > > > > I think I don't even need to check kvm->users_count in the defunct root > > > case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue > > > before it checks that the lists are empty. > > > > Yes, that should work. IIRC, the WARN_ONs will tell us/you quite quickly if > > we're wrong :-) mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all() > > and thus ensure all non-root pages zapped, but "leaking" a worker will trigger > > the WARN_ON that there are no roots on the list. > > Good, for the record these are the commit messages I have: > > KVM: x86/mmu: Zap invalidated roots via asynchronous worker > Use the system worker threads to zap the roots invalidated > by the TDP MMU's "fast zap" mechanism, implemented by > kvm_tdp_mmu_invalidate_all_roots(). > At this point, apart from allowing some parallelism in the zapping of > roots, the workqueue is a glorified linked list: work items are added and > flushed entirely within a single kvm->slots_lock critical section. However, > the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots() > assumes that it owns a reference to all invalid roots; therefore, no > one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the > invalidated roots on a linked list... erm, on a workqueue ensures that > tdp_mmu_zap_root_work() only puts back those extra references that > kvm_mmu_zap_all_invalidated_roots() had gifted to it. > > and > > KVM: x86/mmu: Zap defunct roots via asynchronous worker > Zap defunct roots, a.k.a. roots that have been invalidated after their > last reference was initially dropped, asynchronously via the existing work > queue instead of forcing the work upon the unfortunate task that happened > to drop the last reference. > If a vCPU task drops the last reference, the vCPU is effectively blocked > by the host for the entire duration of the zap. If the root being zapped > happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging > being active, the zap can take several hundred seconds. Unsurprisingly, > most guests are unhappy if a vCPU disappears for hundreds of seconds. > E.g. running a synthetic selftest that triggers a vCPU root zap with > ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds. > Offloading the zap to a worker drops the block time to <100ms. > There is an important nuance to this change. If the same work item > was queued twice before the work function has run, it would only > execute once and one reference would be leaked. Therefore, now that > queueing items is not anymore protected by write_lock(&kvm->mmu_lock), > kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and > skip already invalid roots. On the other hand, kvm_mmu_zap_all_fast() > must return only after those skipped roots have been zapped as well. > These two requirements can be satisfied only if _all_ places that > change invalid to true now schedule the worker before releasing the > mmu_lock. There are just two, kvm_tdp_mmu_put_root() and > kvm_tdp_mmu_invalidate_all_roots(). Very nice!