On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote: > Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even > though mmu_lock must be held to guarantee correctness, i.e. to avoid > false negatives. Dropping the requirement that mmu_lock be held will > allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid > contending mmu_lock when the guest is accessing a range that is being > invalidated by the host. > > Contending mmu_lock can have severe negative side effects for x86's TDP > MMU when running on preemptible kernels, as KVM will yield from the > zapping task (holds mmu_lock for write) when there is lock contention, > and yielding after any SPTEs have been zapped requires a VM-scoped TLB > flush. > > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling > mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop, > e.g. due to caching the in-progress flag and never seeing it go to '0'. > > Force a load of mmu_invalidate_seq as well, even though it isn't strictly > necessary to avoid an infinite loop, as doing so improves the probability > that KVM will detect an invalidation that already completed before > acquiring mmu_lock and bailing anyways. Without the READ_ONCE() on mmu_invalidate_seq, with patch 2 and mmu_invalidate_retry_hva() inlined IIUC the kvm_faultin_pfn() can look like this: fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; <-- (1) smp_rmb(); ... READ_ONCE(vcpu->kvm->mmu_invalidate_in_progress); ... if (vcpu->kvm->mmu_invalidate_seq != fault->mmu_seq) <-- (2) ... Perhaps stupid question :-) Will compiler even believes both vcpu->kvm- >mmu_invaludate_seq and fault->mmu_seq are never changed thus eliminates the code in 1) and 2)? Or all the barriers between are enough to prevent compiler to do such stupid thing? > > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() > may generate a load into a register instead of doing a direct comparison > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost > is a few bytes of code and maaaaybe a cycle or three. > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > include/linux/kvm_host.h | 17 ++++++++++++++--- > 1 file changed, 14 insertions(+), 3 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 7418e881c21c..7314138ba5f4 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1962,18 +1962,29 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, > unsigned long mmu_seq, > unsigned long hva) > { > - lockdep_assert_held(&kvm->mmu_lock); > /* > * If mmu_invalidate_in_progress is non-zero, then the range maintained > * by kvm_mmu_notifier_invalidate_range_start contains all addresses > * that might be being invalidated. Note that it may include some false > * positives, due to shortcuts when handing concurrent invalidations. > + * > + * Note the lack of a memory barriers! The caller *must* hold mmu_lock > + * to avoid false negatives! Holding mmu_lock is not mandatory though, > + * e.g. to allow pre-checking for an in-progress invalidation to > + * avoiding contending mmu_lock. Ensure that the in-progress flag and > + * sequence counter are always read from memory, so that checking for > + * retry in a loop won't result in an infinite retry loop. Don't force > + * loads for start+end, as the key to avoiding an infinite retry loops > + * is observing the 1=>0 transition of in-progress, i.e. getting false > + * negatives (if mmu_lock isn't held) due to stale start+end values is > + * acceptable. > */ > - if (unlikely(kvm->mmu_invalidate_in_progress) && > + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) && > hva >= kvm->mmu_invalidate_range_start && > hva < kvm->mmu_invalidate_range_end) > return 1; > - if (kvm->mmu_invalidate_seq != mmu_seq) > + > + if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq) > return 1; > return 0; > } I am not sure how mmu_invalidate_retry_hva() can be called in a loop so looks all those should be theoretical thing, but the extra cost should be nearly empty as you said. Acked-by: Kai Huang <kai.huang@xxxxxxxxx> (Or perhaps patch 1/2 can be just merged to one patch)