Re: [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 5 Aug 2024 17:45:39 -0700

On Mon, Aug 05, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> 
> The existing retry loop in hva_to_pfn_retry() is extremely pessimistic.
> If there is an invalidation running concurrently, it is effectively just
> a complex busy wait loop because its local mmu_notifier_retry_cache()
> function will always return true.
> 
> It ends up functioning as a very unfair read/write lock. If userspace is
> acting as a 'writer', performing many unrelated MM changes, then the
> hva_to_pfn_retry() function acting as the 'reader' just backs off and
> keep retrying for ever, not making any progress.
> 
> Solve this by introducing a separate 'validating' flag to the GPC, so
> that it can be marked invalid before it's even mapped. This allows the
> invalidation to be moved to the range_end hook, and the retry loop in
> hva_to_pfn_retry() can be changed to loop only if its particular uHVA
> has been affected.

I think I'm missing something.  How does allowing hva_to_pfn_retry() allow KVM
as a whole to make forward progress?  Doesn't getting past hva_to_pfn_retry()
just move the problem to kvm_gpc_check()?

kvm_gpc_refresh() can't be called with gpc->lock held, and nor does it return
with gpc->lock held, so a racing mmu_notifier invalidation can/will acquire
gpc->lock and clear gpc->active, no?

Oh, by "unrelated", you mean _completely_ unrelated?  As in, KVM happens to do a
refresh when userspace is blasting MADV_DONTNEED, and gets stuck retrying for
no good reason?

Servicing guest pages faults has the same problem, which is why
mmu_invalidate_retry_gfn() was added.  Supporting hva-only GPCs made our lives a
little harder, but not horrifically so (there are ordering differences regardless).

Woefully incomplete, but I think this is the gist of what you want:

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index f0039efb9e1e..1c4c95ab7d0a 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -28,6 +28,26 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
        struct gfn_to_pfn_cache *gpc;
 
        spin_lock(&kvm->gpc_lock);
+
+       if (likely(kvm_is_error_hva(kvm->mmu_gpc_invalidate_range_start)) {
+               kvm->mmu_gpc_invalidate_range_start = start;
+               kvm->mmu_gpc_invalidate_range_end = end;
+       } else {
+               /*
+                * Fully tracking multiple concurrent ranges has diminishing
+                * returns. Keep things simple and just find the minimal range
+                * which includes the current and new ranges. As there won't be
+                * enough information to subtract a range after its invalidate
+                * completes, any ranges invalidated concurrently will
+                * accumulate and persist until all outstanding invalidates
+                * complete.
+                */
+               kvm->mmu_gpc_invalidate_range_start =
+                       min(kvm->mmu_gpc_invalidate_range_start, start);
+               kvm->mmu_gpc_invalidate_range_end =
+                       max(kvm->mmu_gpc_invalidate_range_end, end);
+       }
+
        list_for_each_entry(gpc, &kvm->gpc_list, list) {
                read_lock_irq(&gpc->lock);
 
@@ -124,8 +144,11 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
 #endif
 }
 
-static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
+static inline bool mmu_notifier_retry_cache(struct gfn_to_pfn_cache *gpc,
+                                           unsigned long mmu_seq)
 {
+       struct kvm *kvm = gpc->kvm;
+
        /*
         * mn_active_invalidate_count acts for all intents and purposes
         * like mmu_invalidate_in_progress here; but the latter cannot
@@ -138,7 +161,9 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
         * be elevated before the mmu_notifier acquires gpc->lock, and
         * isn't dropped until after mmu_invalidate_seq is updated.
         */
-       if (kvm->mn_active_invalidate_count)
+       if (kvm->mn_active_invalidate_count &&
+           gpc->uhva >= kvm->mmu_gpc_invalidate_range_start &&
+           gpc->uhva < kvm->mmu_gpc_invalidate_range_end)
                return true;
 
        /*
@@ -224,7 +249,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
                 * attempting to refresh.
                 */
                WARN_ON_ONCE(gpc->valid);
-       } while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq));
+       } while (mmu_notifier_retry_cache(gpc, mmu_seq));
 
        gpc->valid = true;
        gpc->pfn = new_pfn;

> Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> ---
> I note I'm deleting a big comment in kvm_main.c about doing the
> invalidation before acquiring mmu_lock. But we don't hold the lock
> in the range_end callback either, do we?

Correct, __kvm_handle_hva_range() acquires and releases mmu_lock.  However, the
intent of the comment was to clarify why GPCs are invalidated in
kvm_mmu_notifier_invalidate_range_start(), as opposed to kvm_mmu_invalidate_begin()
which _is_ called under mmu_lock and is also called if and only if KVM has a
relevant memslot.  E.g. that's why the comment also talks about memslot overlap
checks.

>  
>  include/linux/kvm_types.h |  1 +
>  virt/kvm/kvm_main.c       | 14 ++------
>  virt/kvm/kvm_mm.h         | 12 +++----
>  virt/kvm/pfncache.c       | 75 +++++++++++++++++++--------------------
>  4 files changed, 45 insertions(+), 57 deletions(-)
> 
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 827ecc0b7e10..30ed1019cfc6 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -69,6 +69,7 @@ struct gfn_to_pfn_cache {
>  	void *khva;
>  	kvm_pfn_t pfn;
>  	bool active;
> +	bool validating;

This is a confusing name, partly because KVM usually deals with invalidation
events, but also because it's sticky and stays set long after the act of
validating the GPC is complete.

Something like "needs_invalidation" is the best I can come up with, but I believe
this bikeshed is moot (see above and below).

>  	bool valid;
>  };
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d0788d0a72cc..ffd6ab4c2a16 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -777,18 +777,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	kvm->mn_active_invalidate_count++;
>  	spin_unlock(&kvm->mn_invalidate_lock);
>  
> -	/*
> -	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
> -	 * before acquiring mmu_lock, to avoid holding mmu_lock while acquiring
> -	 * each cache's lock.  There are relatively few caches in existence at
> -	 * any given time, and the caches themselves can check for hva overlap,
> -	 * i.e. don't need to rely on memslot overlap checks for performance.
> -	 * Because this runs without holding mmu_lock, the pfn caches must use
> -	 * mn_active_invalidate_count (see above) instead of
> -	 * mmu_invalidate_in_progress.
> -	 */
> -	gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
> -
>  	/*
>  	 * If one or more memslots were found and thus zapped, notify arch code
>  	 * that guest memory has been reclaimed.  This needs to be done *after*
> @@ -849,6 +837,8 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  	wake = !kvm->mn_active_invalidate_count;
>  	spin_unlock(&kvm->mn_invalidate_lock);
>  
> +	gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);

We can't do this.  The contract with mmu_notifiers is that secondary MMUs must
unmap the hva before returning from invalidate_range_start(), and must not create
new mappings until invalidate_range_end().