Re: [PATCH] KVM: Use precise range-based flush in mmu_notifier hooks when possible

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 20 Aug 2024 10:06:00 -0700

On Tue, Aug 20, 2024, Will Deacon wrote:
> On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote:
> > On Tue, Aug 20, 2024, Will Deacon wrote:
> > > On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote:
> > > > Do arch-specific range-based TLB flushes (if they're supported) when
> > > > flushing in response to mmu_notifier events, as a single range-based flush
> > > > is almost always more performant.  This is especially true in the case of
> > > > mmu_notifier events, as the majority of events that hit a running VM
> > > > operate on a relatively small range of memory.
> > > > 
> > > > Cc: Marc Zyngier <maz@xxxxxxxxxx>
> > > > Cc: Will Deacon <will@xxxxxxxxxx>
> > > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > > > ---
> > > > 
> > > > This is *very* lightly tested, a thumbs up from the ARM world would be much
> > > > appreciated.
> > > > 
> > > >  virt/kvm/kvm_main.c | 15 ++++++++++++++-
> > > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index d0788d0a72cc..46bb95d58d53 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > > >  	struct kvm_gfn_range gfn_range;
> > > >  	struct kvm_memory_slot *slot;
> > > >  	struct kvm_memslots *slots;
> > > > +	bool need_flush = false;
> > > >  	int i, idx;
> > > >  
> > > >  	if (WARN_ON_ONCE(range->end <= range->start))
> > > > @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > > >  					goto mmu_unlock;
> > > >  			}
> > > >  			r.ret |= range->handler(kvm, &gfn_range);
> > > > +
> > > > +			/*
> > > > +			 * Use a precise gfn-based TLB flush when possible, as
> > > > +			 * most mmu_notifier events affect a small-ish range.
> > > > +			 * Fall back to a full TLB flush if the gfn-based flush
> > > > +			 * fails, and don't bother trying the gfn-based flush
> > > > +			 * if a full flush is already pending.
> > > > +			 */
> > > > +			if (range->flush_on_ret && !need_flush && r.ret &&
> > > > +			    kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
> > > > +							     gfn_range.end - gfn_range.start))
> > > > +				need_flush = true;
> > > 
> > > Thanks for having a crack at this.
> > > 
> > > We could still do better in the ->clear_flush_young() case if the
> > 
> > For clear_flush_young(), I 100% think we should let architectures opt out of the
> > flush.  For architectures where it's safe, the primary MMU doesn't do a TLB flush,
> > and hasn't for years.  Sending patches for this (for at least x86 and arm64) is
> > on my todo list.
> 
> I can see the appeal of dropping the invalidation altogether, although
> with the zoo of different micro-architectures we have on arm64 I do
> worry that it could potentially make the AF information a lot useful on
> some parts. Does x86 make any guarantees about when an old pte becomes
> visible to the CPU in the absence of explicit TLB invalidation? (e.g.
> I'm wondering if it's bounded by the next context switch or something
> like that).

Nope.  As pointed out by Yu in the MGLRU series[*], the main argument is that if
there is memory pressure, i.e. a need for reclaim, then it's all but guaranteed
that there will be even more TLB pressure.  And while false negatives are certainly
possible, nr_scanned >> nr_reclaimed >> nr_bad_reclaims, i.e. the total cost of
false negatives is less than the total cost of the flushes because there are
orders of magnitude more pages scanned than there are pages that are incorrectly
reclaimed.

https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@xxxxxxxxxxxxxx

> 
> > Even better would be to kill off mmu_notifier_clear_flush_young() entirely, e.g.
> > if all KVM architectures can elide the flush.
> 
> I, for one, would love to see fewer MMU notifiers :)
> 
> > And even better than that would be to kill pxxx_clear_flush_young_notify() in
> > the kernel, but I suspect that's not feasible as there are architectures that
> > require a TLB flush for correctness.
> 
> I think you might want it for IOMMUs as well.
> 
> > > handler could do the invalidation as part of its page-table walk (for
> > > example, it could use information about the page-table structure such
> > > as the level of the leaves to optimise the invalidation further), but
> > > this does at least avoid zapping the whole VMID on CPUs with range
> > > support.
> > > 
> > > My only slight concern is that, should clear_flush_young() be extended
> > > to operate on more than a single page-at-a-time in future, this will
> > > silently end up invalidating the entire VMID for each memslot unless we
> > > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case.
> > 
> > I'm not sure I follow the "entire VMID for each memslot" concern.  Are you
> > worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide
> > flush?
> 
> The arm64 implementation of kvm_arch_flush_remote_tlbs_range()
> unconditionally returns 0, so we could end up over-invalidating pretty
> badly if that doesn't change. It should be straightforward to fix, but
> I just wanted to point it out because it would be easy to miss too!

Sorry, I'm still not following.  0==success, and gfn_range.{start,end} is scoped
precisely to the overlap between the memslot and hva range.  Regardless of the
number of pages that are passed into clear_flush_young(), KVM should naturally
flush only the exact range being aged.  The only hiccup would be if the hva range
straddles multiple memslots, but if userspace creates multiple memslots for a
single vma, then that's a userspace problem.