Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Mon, 05 Aug 2024 12:04:53 +0100

On Wed, 2022-04-27 at 01:40 +0000, Sean Christopherson wrote:
> Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races
> between the cache itself, and between the cache and mmu_notifier events.
> 
> The existing refresh code attempts to guard against races with the
> mmu_notifier by speculatively marking the cache valid, and then marking
> it invalid if a mmu_notifier invalidation occurs.  That handles the case
> where an invalidation occurs between dropping and re-acquiring gpc->lock,
> but it doesn't handle the scenario where the cache is refreshed after the
> cache was invalidated by the notifier, but before the notifier elevates
> mmu_notifier_count.  The gpc refresh can't use the "retry" helper as its
> invalidation occurs _before_ mmu_notifier_count is elevated and before
> mmu_notifier_range_start is set/updated.
> 
>   CPU0                                    CPU1
>   ----                                    ----
> 
>   gfn_to_pfn_cache_invalidate_start()
>   |
>   -> gpc->valid = false;
>                                           kvm_gfn_to_pfn_cache_refresh()
>                                           |
>                                           |-> gpc->valid = true;
> 
>                                           hva_to_pfn_retry()
>                                           |
>                                           -> acquire kvm->mmu_lock
>                                              kvm->mmu_notifier_count == 0
>                                              mmu_seq == kvm->mmu_notifier_seq
>                                              drop kvm->mmu_lock
>                                              return pfn 'X'
>   acquire kvm->mmu_lock
>   kvm_inc_notifier_count()
>   drop kvm->mmu_lock()
>   kernel frees pfn 'X'
>                                           kvm_gfn_to_pfn_cache_check()
>                                           |
>                                           |-> gpc->valid == true
> 
>                                           caller accesses freed pfn 'X'
> 
> Key off of mn_active_invalidate_count to detect that a pfncache refresh
> needs to wait for an in-progress mmu_notifier invalidation.  While
> mn_active_invalidate_count is not guaranteed to be stable, it is
> guaranteed to be elevated prior to an invalidation acquiring gpc->lock,
> so either the refresh will see an active invalidation and wait, or the
> invalidation will run after the refresh completes.
> 
> Speculatively marking the cache valid is itself flawed, as a concurrent
> kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva
> values.  The KVM Xen use case explicitly allows/wants multiple users;
> even though the caches are allocated per vCPU, __kvm_xen_has_interrupt()
> can read a different vCPU (or vCPUs).  Address this race by invalidating
> the cache prior to dropping gpc->lock (this is made possible by fixing
> the above mmu_notifier race).
> 
> Finally, the refresh logic doesn't protect against concurrent refreshes
> with different GPAs (which may or may not be a desired use case, but its
> allowed in the code), nor does it protect against a false negative on the
> memslot generation.  If the first refresh sees a stale memslot generation,
> it will refresh the hva and generation before moving on to the hva=>pfn
> translation.  If it then drops gpc->lock, a different user can come along,
> acquire gpc->lock, see that the memslot generation is fresh, and skip
> the hva=>pfn update due to the userspace address also matching (because
> it too was updated).  Address this race by adding an "in-progress" flag
> so that the refresh that acquires gpc->lock first runs to completion
> before other users can start their refresh.
> 
> Complicating all of this is the fact that both the hva=>pfn resolution
> and mapping of the kernel address can sleep, i.e. must be done outside
> of gpc->lock
> 
> Fix the above races in one fell swoop, trying to fix each individual race
> in a sane manner is impossible, for all intents and purposes.

Hm, the problem with this (commit 58cd407ca4c6278) is that it ends up
acting like a *really* unfair read/write lock.

If there are a lot of 'writers' invalidating other HVA ranges, then the
hva_to_pfn_retry() function as the 'reader' will back off for ever and
never make progress. Even while a single other invalidation is active
(->mn_active_invalidate_count is non-zero), hva_to_pfn_retry() will
just spin in an elaborate busy-wait loop, mapping same page over and
over again.

(In the repro case I'm not entirely sure *why* there are userspace
threads bashing on MADV_DONTNEED and causing this, but 'Don't Do That
Then' is only a partial answer. The kernel's own locking should allow
it to make progress regardless.)

If we call the GPC invalidate function from invalidate_range_end()
instead of _start, can't we do it concurrently without having to check
for active invalidations or mmu_invalidate_seq?

We can then introduce a 'validating' flag, set before the attempt to
hva_to_pfn() in the loop, so that the translation can be shot down
before it's even made. And use *that* as the trigger for the retry loop
so that only has to retry if its *own* uHVA is messed with.

Patch follows...

Attachment:
smime.p7s

Description: S/MIME cryptographic signature