Re: [PATCH v2] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 21 Aug 2024 07:57:11 +0100

On Tue, 2024-08-20 at 14:44 -0700, Sean Christopherson wrote:
> On Tue, Aug 20, 2024, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > 
> > The existing retry loop in hva_to_pfn_retry() is extremely pessimistic.
> > If there are any concurrent invalidations running, it's effectively just
> > a complex busy wait loop because its local mmu_notifier_retry_cache()
> > function will always return true.
> > 
> > Since multiple invalidations can be running in parallel, this can result
> > in a situation where hva_to_pfn_retry() just backs off and keep retrying
> > for ever, not making any progress.
> > 
> > Solve this by being a bit more selective about when to retry.
> > 
> > Introduce a separate 'needs invalidation' flag to the GPC, which allows
> > it to be marked invalid even while hva_to_pfn_retry() has dropped the
> > lock to map the newly-found PFN. This allows the invalidation to moved
> > to the range_end hook, and the retry loop only occurs for a given GPC if
> > its particular uHVA is affected.
> > 
> > However, the contract for invalidate_range_{start,end} is not like a
> > simple TLB; the pages may have been freed by the time the end hook is
> > called. A "device" may not create new mappings after the _start_ hook is
> > called. To meet this requirement, hva_to_pfn_retry() now waits until no
> > invalidations are currently running which may affect its uHVA, before
> > finally setting the ->valid flag and returning.
> 
> Please split this into 3 patches:
> 
>  1. Add range-based GPC retry
>  2. Add the wait mechanism.
>  3. Add the needs_invalidation logic.
> 
> #1 and #2 make sense to me, but I'm struggling to understanding why #3 is needed.
> KVM absolutely must not touch the memory after .invalidate_range_start(), so I
> don't see what is gained by deferring invalidation to invalidate_range_end().

KVM absolutely must not touch the memory after .invalidate_range_start().

Let us define "touch the memory" as kmapping it, setting gpc->khva,
setting the gpc->valid flag and dropping the write lock on gpc->lock.

Until those things happen, no user of the GPC is going to dereference
gpc->khva and *actually* touch the memory.

So, how do we ensure that hva_to_pfn_retry() never sets the ->valid
flag on a GPC which might be in the process of being invalidated (i.e.
.invalidate_range_start() has been called already, but not yet
.invalidate_range_end())?

If hva_to_pfn_retry() is called between the start/end invalidation
callbacks, the range-based tracking lets it know that there is a
pending invalidation which *might* affect the GPC it's working on, as
gpc_invalidations_pending() returns true. So what happens then? ...

If there is a pending invalidation which *might* affect its GPC, that's
why hva_to_pfn_retry() now waits until any pending invalidations are
complete, then checks the ->needs_invalidation flag to see if that
pending invalidation actually *did* affect its GPC.

This is where #3 comes in. Without it, we'd have to be pessimistic and
assume that *any* time gpc_invalidations_pending() returned true, we
have to ditch the lookup and start again. We would potentially have a
*lot* more false positives in the hva_to_pfn_retry() revalidation loop.

I'm also not entirely sure before the first coffee of the morning has
taken effect, how the locking would work. Wouldn't we want to hold both
kvm->mn_invalidate_lock *and* the gpc->lock at the same time, to check
the range and set gpc->valid 'atomically'? I suppose that's possible
but I'd prefer to avoid it.

Without that atomicity, an .invalidate_range_start() call could occur
while hva_to_pfn_retry() has completed its loop and is setting the
gpc->valid flag.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature