Re: [PATCH v3] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Tue, 6 Feb 2024 10:38:06 +0800

On Mon, Feb 05, 2024 at 10:27:32AM -0800, Sean Christopherson wrote:
> On Mon, Feb 05, 2024, Yan Zhao wrote:
> > On Fri, Feb 02, 2024 at 04:35:18PM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 3c193b096b45..8ce9898914f1 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4415,6 +4415,25 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > >  	if (unlikely(!fault->slot))
> > >  		return kvm_handle_noslot_fault(vcpu, fault, access);
> > >  
> > > +	/*
> > > +	 * Pre-check for a relevant mmu_notifier invalidation event prior to
> > > +	 * acquiring mmu_lock.  If there is an in-progress invalidation and the
> > > +	 * kernel allows preemption, the invalidation task may drop mmu_lock
> > > +	 * and yield in response to mmu_lock being contended, which is *very*
> > > +	 * counter-productive as this vCPU can't actually make forward progress
> > > +	 * until the invalidation completes.  This "unsafe" check can get false
> > > +	 * negatives, i.e. KVM needs to re-check after acquiring mmu_lock.
> > > +	 *
> > > +	 * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
> > > +	 * will never yield mmu_lock in response to contention, as this vCPU is
> > > +	 * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> > > +	 * to detect retry guarantees the worst case latency for the vCPU.
> > > +	 */
> > > +	if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) {
> > > +		kvm_release_pfn_clean(fault->pfn);
> > > +		return RET_PF_RETRY;
> > > +	}
> > > +
> > Could we also add this pre-check before fault in the pfn? like
> > @@ -4404,6 +4404,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > 
> >         fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> >         smp_rmb();
> > +       if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> > +               return RET_PF_CONTINUE;
> > 
> >         ret = __kvm_faultin_pfn(vcpu, fault);
> >         if (ret != RET_PF_CONTINUE)
> > 
> > Though the mmu_seq would be always equal in the check, it can avoid repeated faulting
> > and release pfn for vain during a long zap cycle.
> 
> Just to be super claer, by "repeated faulting", you mean repeated faulting in the
> primary MMU, correct?
>
Yes. Faulting in the primary MMU.
(Please ignore my typo in return type above :))

BTW, will you also send the optmization in v1 as below?
iff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e340098d034..c7617991e290 100644

--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5725,11 +5725,13 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
        }
 
        if (r == RET_PF_INVALID) {
-               r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
-                                         lower_32_bits(error_code), false,
-                                         &emulation_type);
-               if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
-                       return -EIO;
+               do {
+                       r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
+                                                 lower_32_bits(error_code),
+                                                 false, &emulation_type);
+                       if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
+                               return -EIO;
+               while (r == RET_PF_RETRY);
        }
 
        if (r < 0)

> Yeah, I agree, that makes sense.  The only question is whether to check before
> and after, or only before.  When abusing KSM, I see ~99.5% of all invalidations
> being detected before (21.5M out of just over 21.6M).
> 
> I think it makes sense to also check after getting the pfn?  The check is super
> cheap, especially when there isn't an invalidation event as the checks should all
> be quite predictable and cache hot.
I think checking at before and after is reasonable if the cost of check is not a
concern.


> > > +/*
> > > + * This lockless version of the range-based retry check *must* be paired with a
> > > + * call to the locked version after acquiring mmu_lock, i.e. this is safe to
> > > + * use only as a pre-check to avoid contending mmu_lock.  This version *will*
> > > + * get false negatives and false positives.
> > > + */
> > > +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> > > +						   unsigned long mmu_seq,
> > > +						   gfn_t gfn)
> > > +{
> > > +	/*
> > > +	 * Use READ_ONCE() to ensure the in-progress flag and sequence counter
> > > +	 * are always read from memory, e.g. so that checking for retry in a
> > > +	 * loop won't result in an infinite retry loop.  Don't force loads for
> > > +	 * start+end, as the key to avoiding infinite retry loops is observing
> > > +	 * the 1=>0 transition of in-progress, i.e. getting false negatives
> > > +	 * due to stale start+end values is acceptable.
> > > +	 */
> > > +	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
> > > +	    gfn >= kvm->mmu_invalidate_range_start &&
> > > +	    gfn < kvm->mmu_invalidate_range_end)
> > > +		return true;
> > > +
> > Is a smp_rmb() before below check better, given this retry is defined in a header
> > for all platforms?
> 
> No, because the unsafe check very deliberately doesn't provide any ordering
> guarantees.  The READ_ONCE() is purely to ensure forward progress.  And trying to
> provide ordering for mmu_invalidate_in_progress is rather nonsensical.  The smp_rmb()
> in mmu_invalidate_retry() ensures the caller will see a different mmu_invalidate_seq
> or an elevated mmu_invalidate_in_progress.
> 
> For this code, the order in which mmu_invalidate_seq and mmu_invalidate_in_progress
> are checked truly doesn't matter as the range-based checks are will get false
> negatives when performed outside of mmu_lock.   And from a correctness perspective,
> there are zero constraints on when the checks are performed (beyond the existing
> constraints from the earlier smp_rmb() and acquiring mmu_lock).
> 
> If an arch wants ordering guarantees, it can simply use mmu_invalidate_retry()
> without a gfn, which has the advantage of being safe outside of mmu_lock (the
> obvious disadvantage is that it's very imprecise).
Ok. It makes sense!