On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote: > Retry page faults without acquiring mmu_lock if the resolved hva is covered > by an active invalidation. Contending for mmu_lock is especially > problematic on preemptible kernels as the mmu_notifier invalidation task > will yield mmu_lock (see rwlock_needbreak()), delay the in-progress > invalidation, and ultimately increase the latency of resolving the page > fault. And in the worst case scenario, yielding will be accompanied by a > remote TLB flush, e.g. if the invalidation covers a large range of memory > and vCPUs are accessing addresses that were already zapped. > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Reported-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@xxxxxxxxxxxxxxxxxxxxxxxxx > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> Acked-by: Kai Huang <kai.huang@xxxxxxxxx> Nit below ... > --- > arch/x86/kvm/mmu/mmu.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 1a5a1e7d1eb7..8e2e07ed1a1b 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4334,6 +4334,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > if (unlikely(!fault->slot)) > return kvm_handle_noslot_fault(vcpu, fault, access); > > + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) > + return RET_PF_RETRY; > + ... Perhaps a comment saying this is to avoid unnecessary MMU lock contention would be nice. Otherwise we have is_page_fault_stale() called later within the MMU lock. I suppose people only tend to use git blamer when they cannot find answer in the code :-) > return RET_PF_CONTINUE; > } > Btw, currently fault->mmu_seq is set in kvm_faultin_pfn(), which happens after fast_page_fault(). Conceptually, should we move this to even before fast_page_fault() because I assume the range zapping should also apply to the cases that fast_page_fault() handles?