On Tue, Feb 14, 2012 at 07:53:56PM +0100, Andrea Arcangeli wrote: > On Tue, Feb 14, 2012 at 03:29:47PM -0200, Marcelo Tosatti wrote: > > The problem the patch is fixing is not related to page freeing, but > > rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d > > Can't find the commit on kvm.git. Sorry, we got kvm.git out of sync. But you can see an equivalent below. > > > (replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"): > > > > > > During protecting pages for dirty logging, other threads may also try > > to protect a page in mmu_sync_children() or kvm_mmu_get_page(). > > > > In such a case, because get_dirty_log releases mmu_lock before flushing > > TLB's, the following race condition can happen: > > > > A (get_dirty_log) B (another thread) > > > > lock(mmu_lock) > > clear pte.w > > unlock(mmu_lock) > > lock(mmu_lock) > > pte.w is already cleared > > unlock(mmu_lock) > > skip TLB flush > > Not sure which tree it is, but in kvm and upstream I see an > unconditional tlb flush here, not skip (both > kvm_mmu_slot_remove_write_access and kvm_mmu_rmap_write_protect). So I > assume this has been updated in your tree to eb conditional. if (!direct) { if (rmap_write_protect(vcpu->kvm, gfn)) kvm_flush_remote_tlbs(vcpu->kvm); > Also note kvm_mmu_rmap_write_protect, flushes outside of the mmu_lock > in the kvm_mmu_rmap_write_protect case (like in quoted description), > so two write_protect_slot in parallel against each other may not be > ok, but that may be enforced by design if qemu won't ever call that > ioctl from two different userland threads (it doesn't sounds security > related so it should be ok to enforce its safety by userland design). Yes, here is the fix: http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=02b48d00d7f1853bdf8a06da19ca5413ebe334c6 This is an equivalent of 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d. > > > return > > ... > > TLB flush > > > > Though thread B assumes the page has already been protected when it > > returns, the remaining TLB entry will break that assumption. > > Now I get the question of why not running the TLB flush inside the > mmu_lock only if the spte was writable :). > > kvm_mmu_get_page as long as it only runs in the context of a kvm page > fault is ok, because the page fault would be inhibited by the mmu > notifier invalidates, so maybe it's safe. Ah, perhaps, but this was not taken into account before. Can you confirm this is the case so we can revert the invalidate_page patch? > mmu_sync_children seems to have a problem instead, in your tree > get_dirty_log also has an issue if it has been updated to skip the > flush on readonly sptes, like I guess. > > Interesting how the spte is already non present, the page is just > being freed shortly later, but yet we still need to trigger write > faults synchronously and prevent other CPUs in guest mode to further > modify the page to avoid losing dirty bits updates or updates on > pagetables that maps pagetables in the not NPT/EPT case. If the page > was really only going to be freed it would be ok if the other cpus > would still write to it for a little longer until the page was freed. > > Like I wrote in previous email, I was thinking if we'd change the mmu > notifier methods to do an unconditional flush, then every other flush > could also run outside of the mmu_lock. But then I didn't think enough > about this to be sure. My guess is we could move all flushes outside > the mmu_lock if we stop flushling the tlb conditonally to the current > spte values. It'd clearly be slower for an UP guest though :). Large > SMP guests might benefit, if that is feasible at all... It depends how > problematic the mmu_lock is on the large SMP guests and how much we're > saving by doing conditional TLB flushes. Also it should not be necessary for these flushes to be inside mmu_lock on EPT/NPT case (since there is no write protection there). But it would be awkward to differentiate the unlock position based on EPT/NPT. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html