On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote: > mmu_sync_children needs to find any _reachable_ sptes that are unsync, > read the guest pagetable, and sync the sptes. Before it can inspect the > guest pagetable, it needs to write protect it, with rmap_write_protect. So far so good. > In theory it wouldnt be necesarry to call > kvm_flush_remote_tlbs_cond(protected) here, but only > kvm_flush_remote_tlbs(), since the "kvm->remote_tlbs_dirty" information > is not pertinent to mmu_sync_children. Hmm I'm not sure I fully understand how it is not pertinent. When we have a cr3 write in a remote vcpu, mmu_sync_children runs and it resyncs all sptes reachaeble for that remote vcpu context. But to resync the sptes, it also has to get rid of any old writable tlb entry for its remote vcpu where cr3 write is running. Checking only sptes to find writable ones, updating the sptes that are mapped by the writable sptes, and marking them wrprotected, isn't enough, as old spte contents may still be cached in the tlb if remote_tlbs_dirty is true! Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg handler running in the current vcpu has a writable tlb entry active in the vcpu that later does cr3 overwrite. mmu_sync_children running in the remote vcpu will find no writable spte in the rmap chain representing that spte (because that spte that is still cached in the remote tlb, has already been zapped by the current vcpu) but it is still cached and writable in the remote vcpu TLB cache, when cr3 overwrite runs. > But this is done here to "clear" remote_tlbs_dirty (after a > kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an > optimization. The whole point of remote_tlbs_dirty is to defer any smp tlb flush at the least possible time, so how can it be an optimization to run a kvm_flush_remote_tlbs earlier than necessary? The only way this can be an optimization, is to never run kvm_flush_remote_tlbs unless absolutely necessary, and to leave remote_tlbs_dirty true instead of calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond instead of kvm_flush_remote_tlbs cannot ever be an optimization. > > >> @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, > > >> gva_t gva) > > >> rmap_remove(vcpu->kvm, sptep); > > >> if (is_large_pte(*sptep)) > > >> --vcpu->kvm->stat.lpages; > > >> - need_flush = 1; > > >> + vcpu->kvm->remote_tlbs_dirty = true; > > >> } > > >> set_shadow_pte(sptep, shadow_trap_nonpresent_pte); > > >> break; > > >> @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t > > >> gva) > > >> break; > > >> } > > >> - if (need_flush) > > >> - kvm_flush_remote_tlbs(vcpu->kvm); > > >> spin_unlock(&vcpu->kvm->mmu_lock); > > > > AFIK to be compliant with lowlevel archs (without ASN it doesn't > > matter I think as vmx always flush on exit), we have to flush the > > local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests). I > > don't see why it's missing. Or am I wrong? > > Caller does it: > > void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva) > { > vcpu->arch.mmu.invlpg(vcpu, gva); > kvm_mmu_flush_tlb(vcpu); > ++vcpu->stat.invlpg; > } Ah great! Avi also mentioned it I recall but I didn't figure out it was after FNAME(invlpg) returns. But isn't always more efficient to set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests) instead like I'm doing? See my version of kvm_flush_local_tlb, that's a bit different from kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too which I think is worth it. If you like to keep my version of kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so each shadow implementation does it its way. Comments? If you disagree, and you want to run kvm_mmu_flush_tlb in kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests), then I can expand the kvm_flush_local_tlb with the smp_wmb() in the FNAME(invlpg) code. Like: if (need_flush) { smp_wmb(); remote_tlbs_dirty = true; } spin_unlock(mmu_lock); Then the local tlb flush will run when we return from FNAME(invlpg) and remote_tlbs_dirty is set _after_ set_shadow_pte and _before_ releasing mmu_lock, making it still safe against mmu_notifier_invalidate_page/range. > What about protecting remote_tlbs_dirty with mmu_lock? Only caller of > kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which > is not performance sensitive anyway. I thought it too I've to say. Tried a bit too, then I figured out why Avi wanted to do out of order. The reason is that we want to allow kvm_flush_remote_tlbs to run outside the mmu_lock too. So this actually results in more self-containment of the remote_tlb_dirty logic and simpler code. But I documented at least what the smb_wmb is serializing and how it works. > > - if (protected) > > + /* > > + * Avoid leaving stale tlb entries in this vcpu representing > > + * sptes rmap_removed by invlpg without vcpu smp tlb flush. > > + */ > > + if (protected || vcpu->kvm->remote_tlbs_dirty) > > kvm_flush_remote_tlbs(vcpu->kvm); > > +void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond) > +{ > + if (cond || kvm->remote_tlbs_dirty) > + kvm_flush_remote_tlbs(kvm); > +} > > Aren't they the same? I didn't particularly like the kvm_flush_remote_tlbs_cond, and because I ended up editing the patch by hand in my code to fix it as I understood it, I ended up refactoring some bits like this. But if you like I can add it back, I don't really care. > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h > > index 09782a9..060b4a3 100644 > > --- a/arch/x86/kvm/paging_tmpl.h > > +++ b/arch/x86/kvm/paging_tmpl.h > > @@ -483,7 +483,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) > > } > > > > if (need_flush) > > - kvm_flush_remote_tlbs(vcpu->kvm); > > + kvm_flush_local_tlb(vcpu); > > spin_unlock(&vcpu->kvm->mmu_lock); > > No need, caller does it for us. Right but caller does it different, as commented above. Clearly one of the two has to go ;). > OOO ? Out Of Options? Eheeh, I meant out of order. > If remote_tlbs_dirty is protected by mmu_lock, most of this complexity > is unecessary? Answered above. Thanks a lot for review!! My current thought is that we should just move the unnecessary ->tlb_flush from the common invlpg handler to the other lowlevel .invlpg handler and then I hope this optimization will be measurable ;). Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html