On Tue, Feb 14, 2012 at 03:29:47PM -0200, Marcelo Tosatti wrote: > The problem the patch is fixing is not related to page freeing, but > rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d Can't find the commit on kvm.git. > (replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"): > > > During protecting pages for dirty logging, other threads may also try > to protect a page in mmu_sync_children() or kvm_mmu_get_page(). > > In such a case, because get_dirty_log releases mmu_lock before flushing > TLB's, the following race condition can happen: > > A (get_dirty_log) B (another thread) > > lock(mmu_lock) > clear pte.w > unlock(mmu_lock) > lock(mmu_lock) > pte.w is already cleared > unlock(mmu_lock) > skip TLB flush Not sure which tree it is, but in kvm and upstream I see an unconditional tlb flush here, not skip (both kvm_mmu_slot_remove_write_access and kvm_mmu_rmap_write_protect). So I assume this has been updated in your tree to eb conditional. Also note kvm_mmu_rmap_write_protect, flushes outside of the mmu_lock in the kvm_mmu_rmap_write_protect case (like in quoted description), so two write_protect_slot in parallel against each other may not be ok, but that may be enforced by design if qemu won't ever call that ioctl from two different userland threads (it doesn't sounds security related so it should be ok to enforce its safety by userland design). > return > ... > TLB flush > > Though thread B assumes the page has already been protected when it > returns, the remaining TLB entry will break that assumption. Now I get the question of why not running the TLB flush inside the mmu_lock only if the spte was writable :). kvm_mmu_get_page as long as it only runs in the context of a kvm page fault is ok, because the page fault would be inhibited by the mmu notifier invalidates, so maybe it's safe. mmu_sync_children seems to have a problem instead, in your tree get_dirty_log also has an issue if it has been updated to skip the flush on readonly sptes, like I guess. Interesting how the spte is already non present, the page is just being freed shortly later, but yet we still need to trigger write faults synchronously and prevent other CPUs in guest mode to further modify the page to avoid losing dirty bits updates or updates on pagetables that maps pagetables in the not NPT/EPT case. If the page was really only going to be freed it would be ok if the other cpus would still write to it for a little longer until the page was freed. Like I wrote in previous email, I was thinking if we'd change the mmu notifier methods to do an unconditional flush, then every other flush could also run outside of the mmu_lock. But then I didn't think enough about this to be sure. My guess is we could move all flushes outside the mmu_lock if we stop flushling the tlb conditonally to the current spte values. It'd clearly be slower for an UP guest though :). Large SMP guests might benefit, if that is feasible at all... It depends how problematic the mmu_lock is on the large SMP guests and how much we're saving by doing conditional TLB flushes. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html