Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Sat, 18 Apr 2009 17:34:27 +0200

On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote:
> mmu_sync_children needs to find any _reachable_ sptes that are unsync,
> read the guest pagetable, and sync the sptes. Before it can inspect the
> guest pagetable, it needs to write protect it, with rmap_write_protect.

So far so good.

> In theory it wouldnt be necesarry to call
> kvm_flush_remote_tlbs_cond(protected) here, but only
> kvm_flush_remote_tlbs(), since the "kvm->remote_tlbs_dirty" information
> is not pertinent to mmu_sync_children.

Hmm I'm not sure I fully understand how it is not pertinent.

When we have a cr3 write in a remote vcpu, mmu_sync_children runs and
it resyncs all sptes reachaeble for that remote vcpu context. But to
resync the sptes, it also has to get rid of any old writable tlb entry
for its remote vcpu where cr3 write is running. Checking only sptes to
find writable ones, updating the sptes that are mapped by the writable
sptes, and marking them wrprotected, isn't enough, as old spte
contents may still be cached in the tlb if remote_tlbs_dirty is true!

Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg
handler running in the current vcpu has a writable tlb entry active in
the vcpu that later does cr3 overwrite. mmu_sync_children running in
the remote vcpu will find no writable spte in the rmap chain
representing that spte (because that spte that is still cached in the
remote tlb, has already been zapped by the current vcpu) but it is
still cached and writable in the remote vcpu TLB cache, when cr3
overwrite runs.

> But this is done here to "clear" remote_tlbs_dirty (after a
> kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an
> optimization.

The whole point of remote_tlbs_dirty is to defer any smp tlb flush at
the least possible time, so how can it be an optimization to run a
kvm_flush_remote_tlbs earlier than necessary? The only way this can be
an optimization, is to never run kvm_flush_remote_tlbs unless
absolutely necessary, and to leave remote_tlbs_dirty true instead of
calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond
instead of kvm_flush_remote_tlbs cannot ever be an optimization.

> > >>  @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, 
> > >> gva_t gva)
> > >>  				rmap_remove(vcpu->kvm, sptep);
> > >>  				if (is_large_pte(*sptep))
> > >>  					--vcpu->kvm->stat.lpages;
> > >> -				need_flush = 1;
> > >> +				vcpu->kvm->remote_tlbs_dirty = true;
> > >>  			}
> > >>  			set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
> > >>  			break;
> > >> @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
> > >> gva)
> > >>  			break;
> > >>  	}
> > >>  -	if (need_flush)
> > >> -		kvm_flush_remote_tlbs(vcpu->kvm);
> > >>  	spin_unlock(&vcpu->kvm->mmu_lock);
> > 
> > AFIK to be compliant with lowlevel archs (without ASN it doesn't
> > matter I think as vmx always flush on exit), we have to flush the
> > local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests). I
> > don't see why it's missing. Or am I wrong?
> 
> Caller does it:
> 
> void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
> {
>         vcpu->arch.mmu.invlpg(vcpu, gva);
>         kvm_mmu_flush_tlb(vcpu);
>         ++vcpu->stat.invlpg;
> }

Ah great! Avi also mentioned it I recall but I didn't figure out it
was after FNAME(invlpg) returns. But isn't always more efficient to
set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests) instead like I'm doing?

See my version of kvm_flush_local_tlb, that's a bit different from
kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too
which I think is worth it. If you like to keep my version of
kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb
from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so
each shadow implementation does it its way. Comments?

If you disagree, and you want to run kvm_mmu_flush_tlb in
kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests),
then I can expand the kvm_flush_local_tlb with the smp_wmb() in the
FNAME(invlpg) code. Like:

	      if (need_flush) {
	         smp_wmb();
		 remote_tlbs_dirty = true;
	      }
	      spin_unlock(mmu_lock);

Then the local tlb flush will run when we return from FNAME(invlpg)
and remote_tlbs_dirty is set _after_ set_shadow_pte and _before_
releasing mmu_lock, making it still safe against
mmu_notifier_invalidate_page/range.

> What about protecting remote_tlbs_dirty with mmu_lock? Only caller of
> kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which
> is not performance sensitive anyway.

I thought it too I've to say. Tried a bit too, then I figured out why
Avi wanted to do out of order. The reason is that we want to allow
kvm_flush_remote_tlbs to run outside the mmu_lock too. So this
actually results in more self-containment of the remote_tlb_dirty
logic and simpler code. But I documented at least what the smb_wmb is
serializing and how it works.

> > -		if (protected)
> > +		/*
> > +		 * Avoid leaving stale tlb entries in this vcpu representing
> > +		 * sptes rmap_removed by invlpg without vcpu smp tlb flush.
> > +		 */
> > +		if (protected || vcpu->kvm->remote_tlbs_dirty)
> >  			kvm_flush_remote_tlbs(vcpu->kvm);
> 
> +void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond)
> +{
> +   if (cond || kvm->remote_tlbs_dirty)
> +       kvm_flush_remote_tlbs(kvm);
> +}
> 
> Aren't they the same?

I didn't particularly like the kvm_flush_remote_tlbs_cond, and because
I ended up editing the patch by hand in my code to fix it as I
understood it, I ended up refactoring some bits like this. But if you
like I can add it back, I don't really care.

> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 09782a9..060b4a3 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -483,7 +483,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
> >  	}
> >  
> >  	if (need_flush)
> > -		kvm_flush_remote_tlbs(vcpu->kvm);
> > +		kvm_flush_local_tlb(vcpu);
> >  	spin_unlock(&vcpu->kvm->mmu_lock);
> 
> No need, caller does it for us.

Right but caller does it different, as commented above. Clearly one of the two
has to go ;).

> OOO ? Out Of Options?

Eheeh, I meant out of order.

> If remote_tlbs_dirty is protected by mmu_lock, most of this complexity 
> is unecessary?

Answered above.

Thanks a lot for review!! My current thought is that we should just
move the unnecessary ->tlb_flush from the common invlpg handler to the
other lowlevel .invlpg handler and then I hope this optimization will
be measurable ;).

Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html