Re: [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 6 Aug 2024 08:57:55 -0700

On Tue, Aug 06, 2024, David Woodhouse wrote:
> On Tue, 2024-08-06 at 07:03 -0700, Sean Christopherson wrote:
> > On Tue, Aug 06, 2024, David Woodhouse wrote:
> > > On Mon, 2024-08-05 at 17:45 -0700, Sean Christopherson wrote:
> > > > On Mon, Aug 05, 2024, David Woodhouse wrote:
> > > > > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > > > Servicing guest pages faults has the same problem, which is why
> > > > mmu_invalidate_retry_gfn() was added.  Supporting hva-only GPCs made our lives a
> > > > little harder, but not horrifically so (there are ordering differences regardless).
> > > > 
> > > > Woefully incomplete, but I think this is the gist of what you want:
> > > 
> > > Hm, maybe. It does mean that migration occurring all through memory
> > > (indeed, just one at top and bottom of guest memory space) would
> > > perturb GPCs which remain present.
> > 
> > If that happens with a real world VMM, and it's not a blatant VMM goof, then we
> > can fix KVM.  The stage-2 page fault path hammers the mmu_notifier retry logic
> > far more than GPCs, so if a range-based check is inadequate for some use case,
> > then we definitely need to fix both.
> > 
> > In short, I don't see any reason to invent something different for GPCs.
> > 
> > > > > @@ -849,6 +837,8 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > > >         wake = !kvm->mn_active_invalidate_count;
> > > > >         spin_unlock(&kvm->mn_invalidate_lock);
> > > > >  
> > > > > +       gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);
> > > > 
> > > > We can't do this.  The contract with mmu_notifiers is that secondary MMUs must
> > > > unmap the hva before returning from invalidate_range_start(), and must not create
> > > > new mappings until invalidate_range_end().
> 
> Looking at that assertion harder... where is that rule written?

The big comment for invalidate_range_{start,end}() in include/linux/mmu_notifier.h.
The relevant snippets are:

	 * If the subsystem can't guarantee that no additional references are
	 * taken to the pages in the range, it has to implement the
	 * invalidate_range() notifier to remove any references taken after
	 * invalidate_range_start().

	 * invalidate_range_start() is called when all pages in the
	 * range are still mapped and have at least a refcount of one.
	 *
	 * invalidate_range_end() is called when all pages in the
	 * range have been unmapped and the pages have been freed by
	 * the VM.

The last one is key: the pages have already been freed when invalidate_range_end()
is called, and so unmapping at that time would be too late.

> It seems counter-intuitive to me; that isn't how TLBs work. Another CPU can
> populate a TLB entry right up to the moment the PTE is actually *changed* in
> the page tables, and then the CPU which is modifying/zapping the PTE needs to
> perform a remote TLB flush. That remote TLB flush is analogous to the
> invalidate_range_end() call, surely?

KVM's usage isn't about (hardware) TLBs.  Ah, and the history is even somewhat
evident in the above comment I referenced.  invalidate_range() no longer exists,
it was morphed into arch_invalidate_secondary_tlbs().  For secondary MMUs that
reuse the primary MMU's PTEs, mmu_notifier_arch_invalidate_secondary_tlbs() is
indeed called after the PTEs have been modified.

KVM's usage is different.  Because KVM has its own (Secondary) PTEs (commit
1af5a8109904 ("mmu_notifiers: rename invalidate_range notifier") calls them
"software TLBs", but I find that to be a confusing description), zapping on-demand
when the primary PTEs are modified is tricky and ultimately undesirable.

E.g. invoking mmu_notifiers while holding a PTE lock would prevent KVM from
blocking, which can be problematic if KVM needs to zap a large number SPTEs.

And doing invalidation on-demand for each primary PTE would be suboptimal for
cases where a large VMA range is unmapped/modified, e.g. KVM would get a large
number of invalidation events instead of one big, all-encompassing invalidation.

The obvious downside is what you've run into, where the start+end approach forces
KVM to wait for all in-flight invalidations to go away.  But again, in practice
the rudimentary range tracking suffices for all known use cases.

> I'm fairly sure that's how it works for PASID support too; nothing
> prevents the IOMMU+device from populating an IOTLB entry until the PTE
> is actually changed in the process page tables.
> 
> So why can't we do the same for the GPC?
> 
> > > But in the context of the GPC, it is only "mapped" when the ->valid bit is set. 
> > > 
> > > Even the invalidation callback just clears the valid bit, and that
> > > means nobody is allowed to dereference the ->khva any more. It doesn't
> > > matter that the underlying (stale) PFN is still kmapped.
> > > 
> > > Can we not apply the same logic to the hva_to_pfn_retry() loop? Yes, it
> > > might kmap a page that gets removed, but it's not actually created a
> > > new mapping if it hasn't set the ->valid bit.
> > > 
> > > I don't think this version quite meets the constraints, and I might
> > > need to hook *both* the start and end notifiers, and might not like it
> > > once I get there. But I'll have a go...
> > 
> > I'm pretty sure you're going to need the range-based retry logic.  KVM can't
> > safely set gpc->valid until mn_active_invalidate_count reaches zero, so if a GPC
> > refresh comes along after mn_active_invalidate_count has been elevated, it won't
> > be able to set gpc->valid until the MADV_DONTNEED storm goes away.  Without
> > range-based tracking, there's no way to know if a previous invalidation was
> > relevant to the GPC.
> 
> If it is indeed the case that KVM can't just behave like a normal TLB,
> so it and can't set gpc->valid until mn_active_invalidate_count reaches
> zero, it still only needs to *wait* (or spin, maybe). It certainly
> doesn't need to keep looping and remapping the same PFN over and over
> again, as it does at the moment.
> 
> When mn_active_invalidate_count does reach zero, either the young GPC
> will have been invalidated by clearing the (to be renamed) ->validating
> flag, or it won't have been. If it *has* been invalidated, that's when
> hva_to_pfn_retry() needs to go one more time round its full loop.
> 
> So it just needs to wait until any pending (relevant) invalidations
> have completed, *then* check and potentially loop once more.
> 
> And yes, making that *wait* range-based does make some sense, I
> suppose. It becomes "wait for gpc->uhva not to be within the range of
> kvm->mmu_gpc_invalidate_range_{start,end}."

Yep, exactly.  Without range-based tracking, there's no way for KVM to know when
a relevant in-flight invalidation has completed.

> Except... that range can never shrink *except* when
> mn_active_invalidate_count becomes zero, can it?

Not without more sophisticated logic, no.  E.g. if KVM supported tracking multiple
distinct ranges, then individual invalidation ranges could be dropped.  But to
to avoid memory allocations in invalidate_range_start(), KVM would still need to
hardcode the maximum number of in-flight ranges.  E.g. even if KVM used a dynamic
container, we'd probably want the container entries to be "allocated" out of a
cache, and that cache would need a maximum capacity.

With a max limit on the number of ranges, KVM would still be forced to combine
ranges if there are too many in-flight invalidations.

So, because tracking a single range has sufficed for all known use cases, and
it's significantly simpler than tracking multiple ranges, AFAIK no one has pursued
a multi-range tracking implementation.

> So if we do end up waiting, the wake condition is *still* just that the count
> has become zero. There's already a wakeup in that case, on kvm-
> >mn_memslots_update_rcuwait. Can I wait on that?

I suspect you're trying to solve a problem that doesn't exist in practice.
hva_to_pfn_retry() already has a cond_resched(), so getting stuck for a long
duration isn't fatal, just suboptimal.  And similar to the range-based tracking,
_if_ there's a problem in practice, then it also affects guest page faults.  KVM
simply resumes the vCPU and keeps re-faulting until the in-flight invalidation(s)
has gone away.

Not without reworking mn_memslots_update_rcuwait.  KVM assumes there is at most
one waiter, as that wait+wake combination is specifically to handle the case where
a _relevant_ in-flight mmu_notifier invalidation needs to block a userspace memslot
deletion.  KVM takes mmu_lock in invalidate_range_{start,end}() if and only if
there is an overlapping memslot, and so KVM needs to prevent a memslot from being
deleted between start() and end().