Revisiting after hiatus. On 05/21/2012 11:58 PM, Marcelo Tosatti wrote: > On Thu, May 17, 2012 at 01:24:42PM +0300, Avi Kivity wrote: >> Signed-off-by: Avi Kivity <avi@xxxxxxxxxx> >> --- >> virt/kvm/kvm_main.c | 16 ++++++++-------- >> 1 file changed, 8 insertions(+), 8 deletions(-) >> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 585ab45..9f6d15d 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -302,11 +302,11 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, >> kvm->mmu_notifier_seq++; >> if (kvm_unmap_hva(kvm, address)) >> kvm_mark_tlb_dirty(kvm); >> - /* we've to flush the tlb before the pages can be freed */ >> - kvm_cond_flush_remote_tlbs(kvm); >> - >> spin_unlock(&kvm->mmu_lock); >> srcu_read_unlock(&kvm->srcu, idx); >> + >> + /* we've to flush the tlb before the pages can be freed */ >> + kvm_cond_flush_remote_tlbs(kvm); >> } > > There are still sites that assumed implicitly that acquiring mmu_lock > guarantees that sptes and remote TLBs are in sync. Example: > > void kvm_mmu_zap_all(struct kvm *kvm) > { > struct kvm_mmu_page *sp, *node; > LIST_HEAD(invalid_list); > > spin_lock(&kvm->mmu_lock); > restart: > list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, > link) > if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list)) > goto restart; > > kvm_mmu_commit_zap_page(kvm, &invalid_list); > spin_unlock(&kvm->mmu_lock); > } > > kvm_mmu_commit_zap_page only flushes if the TLB was dirtied by this > context, not before. kvm_mmu_unprotect_page is similar. > > In general: > > context 1 context 2 > > lock(mmu_lock) > modify spte > mark_tlb_dirty > unlock(mmu_lock) > lock(mmu_lock) > read spte > make a decision based on spte value > unlock(mmu_lock) > flush remote TLBs > > > Is scary. It scares me too. Could be something trivial like following an intermediate paging structure entry or not, depending on whether it is present. I don't think we'll be able to audit the entire code base to ensure nothing like that happens, or to enforce it later on. > Perhaps have a rule that says: > > 1) Conditionally flush remote TLB after acquiring mmu_lock, > before anything (even perhaps inside the lock macro). I would like to avoid any flush within the lock. We may back down on this goal but let's try to find something that works and doesn't depend on huge audits. > 2) Except special cases where it is clear that this is not > necessary. One option: your idea, but without taking the lock def mmu_begin(): clean = False while not clean: cond_flush the tlb rtlb = kvm.remote_tlb_counter # atomically wrt flush spin_lock mmu_lock if rtlb != kvm.remote_tlb_counter clean = False spin_unlock(mmu_lock) Since we're spinning over the tlb counter, there's no real advantage here except that preemption is enabled. A simpler option is to make mmu_end() do a cond_flush(): def mmu_begin(): spin_lock(mmu_lock) def mmu_end(): spin_unlock(mmu_lock) cond_flush We need something for lockbreaking too: def mmu_lockbreak(): if not (contended or need_resched): return False remember flush counter cond_resched_lock return flush counter changed The caller would check the return value to see if it needs to redo anything. But this has the danger of long operations (like write protecting a slot) never completing. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html