Re: [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock

"Huang, Kai" <kai.huang@xxxxxxxxx> · Thu, 7 Sep 2023 23:30:36 +0000

On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even
> though mmu_lock must be held to guarantee correctness, i.e. to avoid
> false negatives.  Dropping the requirement that mmu_lock be held will
> allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid
> contending mmu_lock when the guest is accessing a range that is being
> invalidated by the host.
> 
> Contending mmu_lock can have severe negative side effects for x86's TDP
> MMU when running on preemptible kernels, as KVM will yield from the
> zapping task (holds mmu_lock for write) when there is lock contention,
> and yielding after any SPTEs have been zapped requires a VM-scoped TLB
> flush.
> 
> Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling
> mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop,
> e.g. due to caching the in-progress flag and never seeing it go to '0'.
> 
> Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> necessary to avoid an infinite loop, as doing so improves the probability
> that KVM will detect an invalidation that already completed before
> acquiring mmu_lock and bailing anyways.

Without the READ_ONCE() on mmu_invalidate_seq, with patch 2 and
mmu_invalidate_retry_hva() inlined IIUC the kvm_faultin_pfn() can look like
this:

	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;		<-- (1)
	smp_rmb();

	...
	READ_ONCE(vcpu->kvm->mmu_invalidate_in_progress);
	...

	if (vcpu->kvm->mmu_invalidate_seq != fault->mmu_seq)	<-- (2)
		...

Perhaps stupid question :-) Will compiler even believes both vcpu->kvm-
>mmu_invaludate_seq and fault->mmu_seq are never changed thus eliminates the
code in 1) and 2)?  Or all the barriers between are enough to prevent compiler
to do such stupid thing?

> 
> Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> may generate a load into a register instead of doing a direct comparison
> (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> is a few bytes of code and maaaaybe a cycle or three.
> 
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
>  include/linux/kvm_host.h | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7418e881c21c..7314138ba5f4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1962,18 +1962,29 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  					   unsigned long mmu_seq,
>  					   unsigned long hva)
>  {
> -	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
>  	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
>  	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
>  	 * that might be being invalidated. Note that it may include some false
>  	 * positives, due to shortcuts when handing concurrent invalidations.
> +	 *
> +	 * Note the lack of a memory barriers!  The caller *must* hold mmu_lock
> +	 * to avoid false negatives!  Holding mmu_lock is not mandatory though,
> +	 * e.g. to allow pre-checking for an in-progress invalidation to
> +	 * avoiding contending mmu_lock.  Ensure that the in-progress flag and
> +	 * sequence counter are always read from memory, so that checking for
> +	 * retry in a loop won't result in an infinite retry loop.  Don't force
> +	 * loads for start+end, as the key to avoiding an infinite retry loops
> +	 * is observing the 1=>0 transition of in-progress, i.e. getting false
> +	 * negatives (if mmu_lock isn't held) due to stale start+end values is
> +	 * acceptable.
>  	 */
> -	if (unlikely(kvm->mmu_invalidate_in_progress) &&
> +	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
>  	    hva >= kvm->mmu_invalidate_range_start &&
>  	    hva < kvm->mmu_invalidate_range_end)
>  		return 1;
> -	if (kvm->mmu_invalidate_seq != mmu_seq)
> +
> +	if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq)
>  		return 1;
>  	return 0;
>  }

I am not sure how mmu_invalidate_retry_hva() can be called in a loop so looks
all those should be theoretical thing, but the extra cost should be nearly empty
as you said.

Acked-by: Kai Huang <kai.huang@xxxxxxxxx>

(Or perhaps patch 1/2 can be just merged to one patch)