Re: [PATCH] KVM: Use precise range-based flush in mmu_notifier hooks when possible

Will Deacon <will@xxxxxxxxxx> · Tue, 20 Aug 2024 17:32:15 +0100

On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote:
> On Tue, Aug 20, 2024, Will Deacon wrote:
> > On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote:
> > > Do arch-specific range-based TLB flushes (if they're supported) when
> > > flushing in response to mmu_notifier events, as a single range-based flush
> > > is almost always more performant.  This is especially true in the case of
> > > mmu_notifier events, as the majority of events that hit a running VM
> > > operate on a relatively small range of memory.
> > > 
> > > Cc: Marc Zyngier <maz@xxxxxxxxxx>
> > > Cc: Will Deacon <will@xxxxxxxxxx>
> > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > > ---
> > > 
> > > This is *very* lightly tested, a thumbs up from the ARM world would be much
> > > appreciated.
> > > 
> > >  virt/kvm/kvm_main.c | 15 ++++++++++++++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index d0788d0a72cc..46bb95d58d53 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > >  	struct kvm_gfn_range gfn_range;
> > >  	struct kvm_memory_slot *slot;
> > >  	struct kvm_memslots *slots;
> > > +	bool need_flush = false;
> > >  	int i, idx;
> > >  
> > >  	if (WARN_ON_ONCE(range->end <= range->start))
> > > @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > >  					goto mmu_unlock;
> > >  			}
> > >  			r.ret |= range->handler(kvm, &gfn_range);
> > > +
> > > +			/*
> > > +			 * Use a precise gfn-based TLB flush when possible, as
> > > +			 * most mmu_notifier events affect a small-ish range.
> > > +			 * Fall back to a full TLB flush if the gfn-based flush
> > > +			 * fails, and don't bother trying the gfn-based flush
> > > +			 * if a full flush is already pending.
> > > +			 */
> > > +			if (range->flush_on_ret && !need_flush && r.ret &&
> > > +			    kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
> > > +							     gfn_range.end - gfn_range.start))
> > > +				need_flush = true;
> > 
> > Thanks for having a crack at this.
> > 
> > We could still do better in the ->clear_flush_young() case if the
> 
> For clear_flush_young(), I 100% think we should let architectures opt out of the
> flush.  For architectures where it's safe, the primary MMU doesn't do a TLB flush,
> and hasn't for years.  Sending patches for this (for at least x86 and arm64) is
> on my todo list.

I can see the appeal of dropping the invalidation altogether, although
with the zoo of different micro-architectures we have on arm64 I do
worry that it could potentially make the AF information a lot useful on
some parts. Does x86 make any guarantees about when an old pte becomes
visible to the CPU in the absence of explicit TLB invalidation? (e.g.
I'm wondering if it's bounded by the next context switch or something
like that).

> Even better would be to kill off mmu_notifier_clear_flush_young() entirely, e.g.
> if all KVM architectures can elide the flush.

I, for one, would love to see fewer MMU notifiers :)

> And even better than that would be to kill pxxx_clear_flush_young_notify() in
> the kernel, but I suspect that's not feasible as there are architectures that
> require a TLB flush for correctness.

I think you might want it for IOMMUs as well.

> > handler could do the invalidation as part of its page-table walk (for
> > example, it could use information about the page-table structure such
> > as the level of the leaves to optimise the invalidation further), but
> > this does at least avoid zapping the whole VMID on CPUs with range
> > support.
> > 
> > My only slight concern is that, should clear_flush_young() be extended
> > to operate on more than a single page-at-a-time in future, this will
> > silently end up invalidating the entire VMID for each memslot unless we
> > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case.
> 
> I'm not sure I follow the "entire VMID for each memslot" concern.  Are you
> worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide
> flush?

The arm64 implementation of kvm_arch_flush_remote_tlbs_range()
unconditionally returns 0, so we could end up over-invalidating pretty
badly if that doesn't change. It should be straightforward to fix, but
I just wanted to point it out because it would be easy to miss too!

Will