Reviewed-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> On Wed, Feb 21, 2024 at 05:26:40PM -0800, Sean Christopherson wrote: > Retry page faults without acquiring mmu_lock, and without even faulting > the page into the primary MMU, if the resolved gfn is covered by an active > invalidation. Contending for mmu_lock is especially problematic on > preemptible kernels as the mmu_notifier invalidation task will yield > mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and > ultimately increase the latency of resolving the page fault. And in the > worst case scenario, yielding will be accompanied by a remote TLB flush, > e.g. if the invalidation covers a large range of memory and vCPUs are > accessing addresses that were already zapped. > > Faulting the page into the primary MMU is similarly problematic, as doing > so may acquire locks that need to be taken for the invalidation to > complete (the primary MMU has finer grained locks than KVM's MMU), and/or > may cause unnecessary churn (getting/putting pages, marking them accessed, > etc). > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Add a dedicated lockess version of the range-based retry check to avoid > false positives on the sanity check on start+end WARN, and so that it's > super obvious that checking for a racing invalidation without holding > mmu_lock is unsafe (though obviously useful). > > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking > invalidation in a loop won't put KVM into an infinite loop, e.g. due to > caching the in-progress flag and never seeing it go to '0'. > > Force a load of mmu_invalidate_seq as well, even though it isn't strictly > necessary to avoid an infinite loop, as doing so improves the probability > that KVM will detect an invalidation that already completed before > acquiring mmu_lock and bailing anyways. > > Do the pre-check even for non-preemptible kernels, as waiting to detect > the invalidation until mmu_lock is held guarantees the vCPU will observe > the worst case latency in terms of handling the fault, and can generate > even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, > detect retry, drop mmu_lock, re-enter the guest, retake the fault, and > eventually re-acquire mmu_lock. This behavior is also why there are no > new starvation issues due to losing the fairness guarantees provided by > rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting > on mmu_lock doesn't guarantee forward progress in the face of _another_ > mmu_notifier invalidation event. > > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() > may generate a load into a register instead of doing a direct comparison > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost > is a few bytes of code and maaaaybe a cycle or three. > > Reported-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@xxxxxxxxxxxxxxxxxxxxxxxxx > Reported-by: Friedrich Weber <f.weber@xxxxxxxxxxx> > Cc: Kai Huang <kai.huang@xxxxxxxxx> > Cc: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Cc: Yuan Yao <yuan.yao@xxxxxxxxxxxxxxx> > Cc: Xu Yilun <yilun.xu@xxxxxxxxxxxxxxx> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > > v5: > - Fix the inverted slot check. [Xu] > - Drop all the other patches (will post separately). > > arch/x86/kvm/mmu/mmu.c | 42 ++++++++++++++++++++++++++++++++++++++++ > include/linux/kvm_host.h | 26 +++++++++++++++++++++++++ > 2 files changed, 68 insertions(+) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 3c193b096b45..274acc53f0e9 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4405,6 +4405,31 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; > smp_rmb(); > > + /* > + * Check for a relevant mmu_notifier invalidation event before getting > + * the pfn from the primary MMU, and before acquiring mmu_lock. > + * > + * For mmu_lock, if there is an in-progress invalidation and the kernel > + * allows preemption, the invalidation task may drop mmu_lock and yield > + * in response to mmu_lock being contended, which is *very* counter- > + * productive as this vCPU can't actually make forward progress until > + * the invalidation completes. > + * > + * Retrying now can also avoid unnessary lock contention in the primary > + * MMU, as the primary MMU doesn't necessarily hold a single lock for > + * the duration of the invalidation, i.e. faulting in a conflicting pfn > + * can cause the invalidation to take longer by holding locks that are > + * needed to complete the invalidation. > + * > + * Do the pre-check even for non-preemtible kernels, i.e. even if KVM > + * will never yield mmu_lock in response to contention, as this vCPU is > + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held > + * to detect retry guarantees the worst case latency for the vCPU. > + */ > + if (fault->slot && > + mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) > + return RET_PF_RETRY; > + > ret = __kvm_faultin_pfn(vcpu, fault); > if (ret != RET_PF_CONTINUE) > return ret; > @@ -4415,6 +4440,18 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > if (unlikely(!fault->slot)) > return kvm_handle_noslot_fault(vcpu, fault, access); > > + /* > + * Check again for a relevant mmu_notifier invalidation event purely to > + * avoid contending mmu_lock. Most invalidations will be detected by > + * the previous check, but checking is extremely cheap relative to the > + * overall cost of failing to detect the invalidation until after > + * mmu_lock is acquired. > + */ > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) { > + kvm_release_pfn_clean(fault->pfn); > + return RET_PF_RETRY; > + } > + > return RET_PF_CONTINUE; > } > > @@ -4442,6 +4479,11 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, > if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu)) > return true; > > + /* > + * Check for a relevant mmu_notifier invalidation event one last time > + * now that mmu_lock is held, as the "unsafe" checks performed without > + * holding mmu_lock can get false negatives. > + */ > return fault->slot && > mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn); > } > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 18e28610749e..97afe4519772 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2062,6 +2062,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm, > return 1; > return 0; > } > + > +/* > + * This lockless version of the range-based retry check *must* be paired with a > + * call to the locked version after acquiring mmu_lock, i.e. this is safe to > + * use only as a pre-check to avoid contending mmu_lock. This version *will* > + * get false negatives and false positives. > + */ > +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm, > + unsigned long mmu_seq, > + gfn_t gfn) > +{ > + /* > + * Use READ_ONCE() to ensure the in-progress flag and sequence counter > + * are always read from memory, e.g. so that checking for retry in a > + * loop won't result in an infinite retry loop. Don't force loads for > + * start+end, as the key to avoiding infinite retry loops is observing > + * the 1=>0 transition of in-progress, i.e. getting false negatives > + * due to stale start+end values is acceptable. > + */ > + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) && > + gfn >= kvm->mmu_invalidate_range_start && > + gfn < kvm->mmu_invalidate_range_end) > + return true; > + > + return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq; > +} > #endif > > #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING > > base-commit: 21dbc438dde69ff630b3264c54b94923ee9fcdcf > -- > 2.44.0.rc0.258.g7320e95886-goog >