On Tue, 2024-08-20 at 14:44 -0700, Sean Christopherson wrote: > On Tue, Aug 20, 2024, David Woodhouse wrote: > > From: David Woodhouse <dwmw@xxxxxxxxxxxx> > > > > The existing retry loop in hva_to_pfn_retry() is extremely pessimistic. > > If there are any concurrent invalidations running, it's effectively just > > a complex busy wait loop because its local mmu_notifier_retry_cache() > > function will always return true. > > > > Since multiple invalidations can be running in parallel, this can result > > in a situation where hva_to_pfn_retry() just backs off and keep retrying > > for ever, not making any progress. > > > > Solve this by being a bit more selective about when to retry. > > > > Introduce a separate 'needs invalidation' flag to the GPC, which allows > > it to be marked invalid even while hva_to_pfn_retry() has dropped the > > lock to map the newly-found PFN. This allows the invalidation to moved > > to the range_end hook, and the retry loop only occurs for a given GPC if > > its particular uHVA is affected. > > > > However, the contract for invalidate_range_{start,end} is not like a > > simple TLB; the pages may have been freed by the time the end hook is > > called. A "device" may not create new mappings after the _start_ hook is > > called. To meet this requirement, hva_to_pfn_retry() now waits until no > > invalidations are currently running which may affect its uHVA, before > > finally setting the ->valid flag and returning. > > Please split this into 3 patches: > > 1. Add range-based GPC retry > 2. Add the wait mechanism. > 3. Add the needs_invalidation logic. > > #1 and #2 make sense to me, but I'm struggling to understanding why #3 is needed. > KVM absolutely must not touch the memory after .invalidate_range_start(), so I > don't see what is gained by deferring invalidation to invalidate_range_end(). KVM absolutely must not touch the memory after .invalidate_range_start(). Let us define "touch the memory" as kmapping it, setting gpc->khva, setting the gpc->valid flag and dropping the write lock on gpc->lock. Until those things happen, no user of the GPC is going to dereference gpc->khva and *actually* touch the memory. So, how do we ensure that hva_to_pfn_retry() never sets the ->valid flag on a GPC which might be in the process of being invalidated (i.e. .invalidate_range_start() has been called already, but not yet .invalidate_range_end())? If hva_to_pfn_retry() is called between the start/end invalidation callbacks, the range-based tracking lets it know that there is a pending invalidation which *might* affect the GPC it's working on, as gpc_invalidations_pending() returns true. So what happens then? ... If there is a pending invalidation which *might* affect its GPC, that's why hva_to_pfn_retry() now waits until any pending invalidations are complete, then checks the ->needs_invalidation flag to see if that pending invalidation actually *did* affect its GPC. This is where #3 comes in. Without it, we'd have to be pessimistic and assume that *any* time gpc_invalidations_pending() returned true, we have to ditch the lookup and start again. We would potentially have a *lot* more false positives in the hva_to_pfn_retry() revalidation loop. I'm also not entirely sure before the first coffee of the morning has taken effect, how the locking would work. Wouldn't we want to hold both kvm->mn_invalidate_lock *and* the gpc->lock at the same time, to check the range and set gpc->valid 'atomically'? I suppose that's possible but I'd prefer to avoid it. Without that atomicity, an .invalidate_range_start() call could occur while hva_to_pfn_retry() has completed its loop and is setting the gpc->valid flag.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature