Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Sun, 19 Apr 2009 14:54:28 -0300

On Sat, Apr 18, 2009 at 05:34:27PM +0200, Andrea Arcangeli wrote:
> On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote:
> > mmu_sync_children needs to find any _reachable_ sptes that are unsync,
> > read the guest pagetable, and sync the sptes. Before it can inspect the
> > guest pagetable, it needs to write protect it, with rmap_write_protect.
> 
> So far so good.
> 
> > In theory it wouldnt be necesarry to call
> > kvm_flush_remote_tlbs_cond(protected) here, but only
> > kvm_flush_remote_tlbs(), since the "kvm->remote_tlbs_dirty" information
> > is not pertinent to mmu_sync_children.
> 
> Hmm I'm not sure I fully understand how it is not pertinent.
> 
> When we have a cr3 write in a remote vcpu, mmu_sync_children runs and
> it resyncs all sptes reachaeble for that remote vcpu context. But to
> resync the sptes, it also has to get rid of any old writable tlb entry
> for its remote vcpu where cr3 write is running. Checking only sptes to
> find writable ones, updating the sptes that are mapped by the writable
> sptes, and marking them wrprotected, isn't enough, as old spte
> contents may still be cached in the tlb if remote_tlbs_dirty is true!
>
> Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg
> handler running in the current vcpu has a writable tlb entry active in
> the vcpu that later does cr3 overwrite. mmu_sync_children running in
> the remote vcpu will find no writable spte in the rmap chain
> representing that spte (because that spte that is still cached in the
> remote tlb, has already been zapped by the current vcpu) but it is
> still cached and writable in the remote vcpu TLB cache, when cr3
> overwrite runs.

Right, so you have to cope with the fact that invlpg can skip a TLB
flush. OK.

> > But this is done here to "clear" remote_tlbs_dirty (after a
> > kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an
> > optimization.
> 
> The whole point of remote_tlbs_dirty is to defer any smp tlb flush at
> the least possible time, so how can it be an optimization to run a
> kvm_flush_remote_tlbs earlier than necessary? The only way this can be
> an optimization, is to never run kvm_flush_remote_tlbs unless
> absolutely necessary, and to leave remote_tlbs_dirty true instead of
> calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond
> instead of kvm_flush_remote_tlbs cannot ever be an optimization.

Right.

> > > >>  @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, 
> > > >> gva_t gva)
> > > >>  				rmap_remove(vcpu->kvm, sptep);
> > > >>  				if (is_large_pte(*sptep))
> > > >>  					--vcpu->kvm->stat.lpages;
> > > >> -				need_flush = 1;
> > > >> +				vcpu->kvm->remote_tlbs_dirty = true;
> > > >>  			}
> > > >>  			set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
> > > >>  			break;
> > > >> @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
> > > >> gva)
> > > >>  			break;
> > > >>  	}
> > > >>  -	if (need_flush)
> > > >> -		kvm_flush_remote_tlbs(vcpu->kvm);
> > > >>  	spin_unlock(&vcpu->kvm->mmu_lock);
> > > 
> > > AFIK to be compliant with lowlevel archs (without ASN it doesn't
> > > matter I think as vmx always flush on exit), we have to flush the
> > > local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests). I
> > > don't see why it's missing. Or am I wrong?
> > 
> > Caller does it:
> > 
> > void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
> > {
> >         vcpu->arch.mmu.invlpg(vcpu, gva);
> >         kvm_mmu_flush_tlb(vcpu);
> >         ++vcpu->stat.invlpg;
> > }
> 
> Ah great! Avi also mentioned it I recall but I didn't figure out it
> was after FNAME(invlpg) returns. But isn't always more efficient to
> set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests) instead like I'm doing?

Sure that works too.

> See my version of kvm_flush_local_tlb, that's a bit different from
> kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too
> which I think is worth it. If you like to keep my version of
> kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb
> from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so
> each shadow implementation does it its way. Comments?

I'm fine with your kvm_flush_local_tlb. Just one minor nit:

+   /* get new asid before returning to guest mode */
+   if (!test_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests))
+               set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests);

Whats the test_bit for?

> If you disagree, and you want to run kvm_mmu_flush_tlb in
> kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests),
> then I can expand the kvm_flush_local_tlb with the smp_wmb() in the
> FNAME(invlpg) code. Like:
> 
> 	      if (need_flush) {
> 	         smp_wmb();
> 		 remote_tlbs_dirty = true;
> 	      }
> 	      spin_unlock(mmu_lock);
> 
> Then the local tlb flush will run when we return from FNAME(invlpg)
> and remote_tlbs_dirty is set _after_ set_shadow_pte and _before_
> releasing mmu_lock, making it still safe against
> mmu_notifier_invalidate_page/range.
> 
> > What about protecting remote_tlbs_dirty with mmu_lock? Only caller of
> > kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which
> > is not performance sensitive anyway.
> 
> I thought it too I've to say. Tried a bit too, then I figured out why
> Avi wanted to do out of order. The reason is that we want to allow
> kvm_flush_remote_tlbs to run outside the mmu_lock too. So this
> actually results in more self-containment of the remote_tlb_dirty
> logic and simpler code. But I documented at least what the smb_wmb is
> serializing and how it works.

OK.

> > > -		if (protected)
> > > +		/*
> > > +		 * Avoid leaving stale tlb entries in this vcpu representing
> > > +		 * sptes rmap_removed by invlpg without vcpu smp tlb flush.
> > > +		 */
> > > +		if (protected || vcpu->kvm->remote_tlbs_dirty)
> > >  			kvm_flush_remote_tlbs(vcpu->kvm);
> > 
> > +void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond)
> > +{
> > +   if (cond || kvm->remote_tlbs_dirty)
> > +       kvm_flush_remote_tlbs(kvm);
> > +}
> > 
> > Aren't they the same?
> 
> I didn't particularly like the kvm_flush_remote_tlbs_cond, and because
> I ended up editing the patch by hand in my code to fix it as I
> understood it, I ended up refactoring some bits like this. But if you
> like I can add it back, I don't really care.

It was nice to hide explicit knowledge about
vcpu->kvm->remote_tlbs_dirty behind the interface instead of exposing
it.

> 
> > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > > index 09782a9..060b4a3 100644
> > > --- a/arch/x86/kvm/paging_tmpl.h
> > > +++ b/arch/x86/kvm/paging_tmpl.h
> > > @@ -483,7 +483,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
> > >  	}
> > >  
> > >  	if (need_flush)
> > > -		kvm_flush_remote_tlbs(vcpu->kvm);
> > > +		kvm_flush_local_tlb(vcpu);
> > >  	spin_unlock(&vcpu->kvm->mmu_lock);
> > 
> > No need, caller does it for us.
> 
> Right but caller does it different, as commented above. Clearly one of the two
> has to go ;).
> 
> > OOO ? Out Of Options?
> 
> Eheeh, I meant out of order.
> 
> > If remote_tlbs_dirty is protected by mmu_lock, most of this complexity 
> > is unecessary?
> 
> Answered above.
> 
> Thanks a lot for review!! My current thought is that we should just
> move the unnecessary ->tlb_flush from the common invlpg handler to the
> other lowlevel .invlpg handler and then I hope this optimization will
> be measurable ;).

Depends on how often the guest uses invlpg right. I believe its a
worthwhile optimization.

TIA

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html