On Wed, 2022-04-27 at 01:40 +0000, Sean Christopherson wrote: > Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races > between the cache itself, and between the cache and mmu_notifier events. > > The existing refresh code attempts to guard against races with the > mmu_notifier by speculatively marking the cache valid, and then marking > it invalid if a mmu_notifier invalidation occurs. That handles the case > where an invalidation occurs between dropping and re-acquiring gpc->lock, > but it doesn't handle the scenario where the cache is refreshed after the > cache was invalidated by the notifier, but before the notifier elevates > mmu_notifier_count. The gpc refresh can't use the "retry" helper as its > invalidation occurs _before_ mmu_notifier_count is elevated and before > mmu_notifier_range_start is set/updated. > > CPU0 CPU1 > ---- ---- > > gfn_to_pfn_cache_invalidate_start() > | > -> gpc->valid = false; > kvm_gfn_to_pfn_cache_refresh() > | > |-> gpc->valid = true; > > hva_to_pfn_retry() > | > -> acquire kvm->mmu_lock > kvm->mmu_notifier_count == 0 > mmu_seq == kvm->mmu_notifier_seq > drop kvm->mmu_lock > return pfn 'X' > acquire kvm->mmu_lock > kvm_inc_notifier_count() > drop kvm->mmu_lock() > kernel frees pfn 'X' > kvm_gfn_to_pfn_cache_check() > | > |-> gpc->valid == true > > caller accesses freed pfn 'X' > > Key off of mn_active_invalidate_count to detect that a pfncache refresh > needs to wait for an in-progress mmu_notifier invalidation. While > mn_active_invalidate_count is not guaranteed to be stable, it is > guaranteed to be elevated prior to an invalidation acquiring gpc->lock, > so either the refresh will see an active invalidation and wait, or the > invalidation will run after the refresh completes. > > Speculatively marking the cache valid is itself flawed, as a concurrent > kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva > values. The KVM Xen use case explicitly allows/wants multiple users; > even though the caches are allocated per vCPU, __kvm_xen_has_interrupt() > can read a different vCPU (or vCPUs). Address this race by invalidating > the cache prior to dropping gpc->lock (this is made possible by fixing > the above mmu_notifier race). > > Finally, the refresh logic doesn't protect against concurrent refreshes > with different GPAs (which may or may not be a desired use case, but its > allowed in the code), nor does it protect against a false negative on the > memslot generation. If the first refresh sees a stale memslot generation, > it will refresh the hva and generation before moving on to the hva=>pfn > translation. If it then drops gpc->lock, a different user can come along, > acquire gpc->lock, see that the memslot generation is fresh, and skip > the hva=>pfn update due to the userspace address also matching (because > it too was updated). Address this race by adding an "in-progress" flag > so that the refresh that acquires gpc->lock first runs to completion > before other users can start their refresh. > > Complicating all of this is the fact that both the hva=>pfn resolution > and mapping of the kernel address can sleep, i.e. must be done outside > of gpc->lock > > Fix the above races in one fell swoop, trying to fix each individual race > in a sane manner is impossible, for all intents and purposes. Hm, the problem with this (commit 58cd407ca4c6278) is that it ends up acting like a *really* unfair read/write lock. If there are a lot of 'writers' invalidating other HVA ranges, then the hva_to_pfn_retry() function as the 'reader' will back off for ever and never make progress. Even while a single other invalidation is active (->mn_active_invalidate_count is non-zero), hva_to_pfn_retry() will just spin in an elaborate busy-wait loop, mapping same page over and over again. (In the repro case I'm not entirely sure *why* there are userspace threads bashing on MADV_DONTNEED and causing this, but 'Don't Do That Then' is only a partial answer. The kernel's own locking should allow it to make progress regardless.) If we call the GPC invalidate function from invalidate_range_end() instead of _start, can't we do it concurrently without having to check for active invalidations or mmu_invalidate_seq? We can then introduce a 'validating' flag, set before the attempt to hva_to_pfn() in the loop, so that the translation can be shot down before it's even made. And use *that* as the trigger for the retry loop so that only has to retry if its *own* uHVA is messed with. Patch follows...
Attachment:
smime.p7s
Description: S/MIME cryptographic signature