On Mon, Feb 05, 2024 at 10:27:32AM -0800, Sean Christopherson wrote: > On Mon, Feb 05, 2024, Yan Zhao wrote: > > On Fri, Feb 02, 2024 at 04:35:18PM -0800, Sean Christopherson wrote: > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index 3c193b096b45..8ce9898914f1 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -4415,6 +4415,25 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > > if (unlikely(!fault->slot)) > > > return kvm_handle_noslot_fault(vcpu, fault, access); > > > > > > + /* > > > + * Pre-check for a relevant mmu_notifier invalidation event prior to > > > + * acquiring mmu_lock. If there is an in-progress invalidation and the > > > + * kernel allows preemption, the invalidation task may drop mmu_lock > > > + * and yield in response to mmu_lock being contended, which is *very* > > > + * counter-productive as this vCPU can't actually make forward progress > > > + * until the invalidation completes. This "unsafe" check can get false > > > + * negatives, i.e. KVM needs to re-check after acquiring mmu_lock. > > > + * > > > + * Do the pre-check even for non-preemtible kernels, i.e. even if KVM > > > + * will never yield mmu_lock in response to contention, as this vCPU is > > > + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held > > > + * to detect retry guarantees the worst case latency for the vCPU. > > > + */ > > > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) { > > > + kvm_release_pfn_clean(fault->pfn); > > > + return RET_PF_RETRY; > > > + } > > > + > > Could we also add this pre-check before fault in the pfn? like > > @@ -4404,6 +4404,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > > > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; > > smp_rmb(); > > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) > > + return RET_PF_CONTINUE; > > > > ret = __kvm_faultin_pfn(vcpu, fault); > > if (ret != RET_PF_CONTINUE) > > > > Though the mmu_seq would be always equal in the check, it can avoid repeated faulting > > and release pfn for vain during a long zap cycle. > > Just to be super claer, by "repeated faulting", you mean repeated faulting in the > primary MMU, correct? > Yes. Faulting in the primary MMU. (Please ignore my typo in return type above :)) BTW, will you also send the optmization in v1 as below? iff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 1e340098d034..c7617991e290 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5725,11 +5725,13 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err } if (r == RET_PF_INVALID) { - r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, - lower_32_bits(error_code), false, - &emulation_type); - if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm)) - return -EIO; + do { + r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, + lower_32_bits(error_code), + false, &emulation_type); + if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm)) + return -EIO; + while (r == RET_PF_RETRY); } if (r < 0) > Yeah, I agree, that makes sense. The only question is whether to check before > and after, or only before. When abusing KSM, I see ~99.5% of all invalidations > being detected before (21.5M out of just over 21.6M). > > I think it makes sense to also check after getting the pfn? The check is super > cheap, especially when there isn't an invalidation event as the checks should all > be quite predictable and cache hot. I think checking at before and after is reasonable if the cost of check is not a concern. > > > +/* > > > + * This lockless version of the range-based retry check *must* be paired with a > > > + * call to the locked version after acquiring mmu_lock, i.e. this is safe to > > > + * use only as a pre-check to avoid contending mmu_lock. This version *will* > > > + * get false negatives and false positives. > > > + */ > > > +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm, > > > + unsigned long mmu_seq, > > > + gfn_t gfn) > > > +{ > > > + /* > > > + * Use READ_ONCE() to ensure the in-progress flag and sequence counter > > > + * are always read from memory, e.g. so that checking for retry in a > > > + * loop won't result in an infinite retry loop. Don't force loads for > > > + * start+end, as the key to avoiding infinite retry loops is observing > > > + * the 1=>0 transition of in-progress, i.e. getting false negatives > > > + * due to stale start+end values is acceptable. > > > + */ > > > + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) && > > > + gfn >= kvm->mmu_invalidate_range_start && > > > + gfn < kvm->mmu_invalidate_range_end) > > > + return true; > > > + > > Is a smp_rmb() before below check better, given this retry is defined in a header > > for all platforms? > > No, because the unsafe check very deliberately doesn't provide any ordering > guarantees. The READ_ONCE() is purely to ensure forward progress. And trying to > provide ordering for mmu_invalidate_in_progress is rather nonsensical. The smp_rmb() > in mmu_invalidate_retry() ensures the caller will see a different mmu_invalidate_seq > or an elevated mmu_invalidate_in_progress. > > For this code, the order in which mmu_invalidate_seq and mmu_invalidate_in_progress > are checked truly doesn't matter as the range-based checks are will get false > negatives when performed outside of mmu_lock. And from a correctness perspective, > there are zero constraints on when the checks are performed (beyond the existing > constraints from the earlier smp_rmb() and acquiring mmu_lock). > > If an arch wants ordering guarantees, it can simply use mmu_invalidate_retry() > without a gfn, which has the advantage of being safe outside of mmu_lock (the > obvious disadvantage is that it's very imprecise). Ok. It makes sense!