On Sat, Apr 18, 2009 at 05:34:27PM +0200, Andrea Arcangeli wrote: > On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote: > > mmu_sync_children needs to find any _reachable_ sptes that are unsync, > > read the guest pagetable, and sync the sptes. Before it can inspect the > > guest pagetable, it needs to write protect it, with rmap_write_protect. > > So far so good. > > > In theory it wouldnt be necesarry to call > > kvm_flush_remote_tlbs_cond(protected) here, but only > > kvm_flush_remote_tlbs(), since the "kvm->remote_tlbs_dirty" information > > is not pertinent to mmu_sync_children. > > Hmm I'm not sure I fully understand how it is not pertinent. > > When we have a cr3 write in a remote vcpu, mmu_sync_children runs and > it resyncs all sptes reachaeble for that remote vcpu context. But to > resync the sptes, it also has to get rid of any old writable tlb entry > for its remote vcpu where cr3 write is running. Checking only sptes to > find writable ones, updating the sptes that are mapped by the writable > sptes, and marking them wrprotected, isn't enough, as old spte > contents may still be cached in the tlb if remote_tlbs_dirty is true! > > Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg > handler running in the current vcpu has a writable tlb entry active in > the vcpu that later does cr3 overwrite. mmu_sync_children running in > the remote vcpu will find no writable spte in the rmap chain > representing that spte (because that spte that is still cached in the > remote tlb, has already been zapped by the current vcpu) but it is > still cached and writable in the remote vcpu TLB cache, when cr3 > overwrite runs. Right, so you have to cope with the fact that invlpg can skip a TLB flush. OK. > > But this is done here to "clear" remote_tlbs_dirty (after a > > kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an > > optimization. > > The whole point of remote_tlbs_dirty is to defer any smp tlb flush at > the least possible time, so how can it be an optimization to run a > kvm_flush_remote_tlbs earlier than necessary? The only way this can be > an optimization, is to never run kvm_flush_remote_tlbs unless > absolutely necessary, and to leave remote_tlbs_dirty true instead of > calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond > instead of kvm_flush_remote_tlbs cannot ever be an optimization. Right. > > > >> @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, > > > >> gva_t gva) > > > >> rmap_remove(vcpu->kvm, sptep); > > > >> if (is_large_pte(*sptep)) > > > >> --vcpu->kvm->stat.lpages; > > > >> - need_flush = 1; > > > >> + vcpu->kvm->remote_tlbs_dirty = true; > > > >> } > > > >> set_shadow_pte(sptep, shadow_trap_nonpresent_pte); > > > >> break; > > > >> @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t > > > >> gva) > > > >> break; > > > >> } > > > >> - if (need_flush) > > > >> - kvm_flush_remote_tlbs(vcpu->kvm); > > > >> spin_unlock(&vcpu->kvm->mmu_lock); > > > > > > AFIK to be compliant with lowlevel archs (without ASN it doesn't > > > matter I think as vmx always flush on exit), we have to flush the > > > local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests). I > > > don't see why it's missing. Or am I wrong? > > > > Caller does it: > > > > void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva) > > { > > vcpu->arch.mmu.invlpg(vcpu, gva); > > kvm_mmu_flush_tlb(vcpu); > > ++vcpu->stat.invlpg; > > } > > Ah great! Avi also mentioned it I recall but I didn't figure out it > was after FNAME(invlpg) returns. But isn't always more efficient to > set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests) instead like I'm doing? Sure that works too. > See my version of kvm_flush_local_tlb, that's a bit different from > kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too > which I think is worth it. If you like to keep my version of > kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb > from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so > each shadow implementation does it its way. Comments? I'm fine with your kvm_flush_local_tlb. Just one minor nit: + /* get new asid before returning to guest mode */ + if (!test_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests)) + set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests); Whats the test_bit for? > If you disagree, and you want to run kvm_mmu_flush_tlb in > kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests), > then I can expand the kvm_flush_local_tlb with the smp_wmb() in the > FNAME(invlpg) code. Like: > > if (need_flush) { > smp_wmb(); > remote_tlbs_dirty = true; > } > spin_unlock(mmu_lock); > > Then the local tlb flush will run when we return from FNAME(invlpg) > and remote_tlbs_dirty is set _after_ set_shadow_pte and _before_ > releasing mmu_lock, making it still safe against > mmu_notifier_invalidate_page/range. > > > What about protecting remote_tlbs_dirty with mmu_lock? Only caller of > > kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which > > is not performance sensitive anyway. > > I thought it too I've to say. Tried a bit too, then I figured out why > Avi wanted to do out of order. The reason is that we want to allow > kvm_flush_remote_tlbs to run outside the mmu_lock too. So this > actually results in more self-containment of the remote_tlb_dirty > logic and simpler code. But I documented at least what the smb_wmb is > serializing and how it works. OK. > > > - if (protected) > > > + /* > > > + * Avoid leaving stale tlb entries in this vcpu representing > > > + * sptes rmap_removed by invlpg without vcpu smp tlb flush. > > > + */ > > > + if (protected || vcpu->kvm->remote_tlbs_dirty) > > > kvm_flush_remote_tlbs(vcpu->kvm); > > > > +void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond) > > +{ > > + if (cond || kvm->remote_tlbs_dirty) > > + kvm_flush_remote_tlbs(kvm); > > +} > > > > Aren't they the same? > > I didn't particularly like the kvm_flush_remote_tlbs_cond, and because > I ended up editing the patch by hand in my code to fix it as I > understood it, I ended up refactoring some bits like this. But if you > like I can add it back, I don't really care. It was nice to hide explicit knowledge about vcpu->kvm->remote_tlbs_dirty behind the interface instead of exposing it. > > > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h > > > index 09782a9..060b4a3 100644 > > > --- a/arch/x86/kvm/paging_tmpl.h > > > +++ b/arch/x86/kvm/paging_tmpl.h > > > @@ -483,7 +483,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) > > > } > > > > > > if (need_flush) > > > - kvm_flush_remote_tlbs(vcpu->kvm); > > > + kvm_flush_local_tlb(vcpu); > > > spin_unlock(&vcpu->kvm->mmu_lock); > > > > No need, caller does it for us. > > Right but caller does it different, as commented above. Clearly one of the two > has to go ;). > > > OOO ? Out Of Options? > > Eheeh, I meant out of order. > > > If remote_tlbs_dirty is protected by mmu_lock, most of this complexity > > is unecessary? > > Answered above. > > Thanks a lot for review!! My current thought is that we should just > move the unnecessary ->tlb_flush from the common invlpg handler to the > other lowlevel .invlpg handler and then I hope this optimization will > be measurable ;). Depends on how often the guest uses invlpg right. I believe its a worthwhile optimization. TIA -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html