Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Tue, 14 Feb 2012 17:43:42 -0200

On Tue, Feb 14, 2012 at 07:53:56PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 14, 2012 at 03:29:47PM -0200, Marcelo Tosatti wrote:
> > The problem the patch is fixing is not related to page freeing, but
> > rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d
> 
> Can't find the commit on kvm.git.

Sorry, we got kvm.git out of sync. But you can see an equivalent below.

> 
> > (replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"):
> > 
> > 
> > During protecting pages for dirty logging, other threads may also try
> > to protect a page in mmu_sync_children() or kvm_mmu_get_page().
> > 
> > In such a case, because get_dirty_log releases mmu_lock before flushing
> > TLB's, the following race condition can happen:
> > 
> >   A (get_dirty_log)     B (another thread)
> > 
> >   lock(mmu_lock)
> >   clear pte.w
> >   unlock(mmu_lock)
> >                         lock(mmu_lock)
> >                         pte.w is already cleared
> >                         unlock(mmu_lock)
> >                         skip TLB flush
> 
> Not sure which tree it is, but in kvm and upstream I see an
> unconditional tlb flush here, not skip (both
> kvm_mmu_slot_remove_write_access and kvm_mmu_rmap_write_protect). So I
> assume this has been updated in your tree to eb conditional.

        if (!direct) {
                if (rmap_write_protect(vcpu->kvm, gfn))
                        kvm_flush_remote_tlbs(vcpu->kvm);

> Also note kvm_mmu_rmap_write_protect, flushes outside of the mmu_lock
> in the kvm_mmu_rmap_write_protect case (like in quoted description),
> so two write_protect_slot in parallel against each other may not be
> ok, but that may be enforced by design if qemu won't ever call that
> ioctl from two different userland threads (it doesn't sounds security
> related so it should be ok to enforce its safety by userland design).

Yes, here is the fix:

http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=02b48d00d7f1853bdf8a06da19ca5413ebe334c6

This is an equivalent of 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d.

> 
> >                         return
> >   ...
> >   TLB flush
> > 
> > Though thread B assumes the page has already been protected when it
> > returns, the remaining TLB entry will break that assumption.
> 
> Now I get the question of why not running the TLB flush inside the
> mmu_lock only if the spte was writable :).
> 
> kvm_mmu_get_page as long as it only runs in the context of a kvm page
> fault is ok, because the page fault would be inhibited by the mmu
> notifier invalidates, so maybe it's safe.

Ah, perhaps, but this was not taken into account before. Can you confirm
this is the case so we can revert the invalidate_page patch?

> mmu_sync_children seems to have a problem instead, in your tree
> get_dirty_log also has an issue if it has been updated to skip the
> flush on readonly sptes, like I guess.
> 
> Interesting how the spte is already non present, the page is just
> being freed shortly later, but yet we still need to trigger write
> faults synchronously and prevent other CPUs in guest mode to further
> modify the page to avoid losing dirty bits updates or updates on
> pagetables that maps pagetables in the not NPT/EPT case. If the page
> was really only going to be freed it would be ok if the other cpus
> would still write to it for a little longer until the page was freed.
> 
> Like I wrote in previous email, I was thinking if we'd change the mmu
> notifier methods to do an unconditional flush, then every other flush
> could also run outside of the mmu_lock. But then I didn't think enough
> about this to be sure. My guess is we could move all flushes outside
> the mmu_lock if we stop flushling the tlb conditonally to the current
> spte values. It'd clearly be slower for an UP guest though :). Large
> SMP guests might benefit, if that is feasible at all... It depends how
> problematic the mmu_lock is on the large SMP guests and how much we're
> saving by doing conditional TLB flushes.

Also it should not be necessary for these flushes to be inside mmu_lock
on EPT/NPT case (since there is no write protection there). But it would
be awkward to differentiate the unlock position based on EPT/NPT.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html