Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Tue, 14 Feb 2012 19:53:56 +0100

On Tue, Feb 14, 2012 at 03:29:47PM -0200, Marcelo Tosatti wrote:
> The problem the patch is fixing is not related to page freeing, but
> rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d

Can't find the commit on kvm.git.

> (replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"):
> 
> 
> During protecting pages for dirty logging, other threads may also try
> to protect a page in mmu_sync_children() or kvm_mmu_get_page().
> 
> In such a case, because get_dirty_log releases mmu_lock before flushing
> TLB's, the following race condition can happen:
> 
>   A (get_dirty_log)     B (another thread)
> 
>   lock(mmu_lock)
>   clear pte.w
>   unlock(mmu_lock)
>                         lock(mmu_lock)
>                         pte.w is already cleared
>                         unlock(mmu_lock)
>                         skip TLB flush

Not sure which tree it is, but in kvm and upstream I see an
unconditional tlb flush here, not skip (both
kvm_mmu_slot_remove_write_access and kvm_mmu_rmap_write_protect). So I
assume this has been updated in your tree to eb conditional.

Also note kvm_mmu_rmap_write_protect, flushes outside of the mmu_lock
in the kvm_mmu_rmap_write_protect case (like in quoted description),
so two write_protect_slot in parallel against each other may not be
ok, but that may be enforced by design if qemu won't ever call that
ioctl from two different userland threads (it doesn't sounds security
related so it should be ok to enforce its safety by userland design).

>                         return
>   ...
>   TLB flush
> 
> Though thread B assumes the page has already been protected when it
> returns, the remaining TLB entry will break that assumption.

Now I get the question of why not running the TLB flush inside the
mmu_lock only if the spte was writable :).

kvm_mmu_get_page as long as it only runs in the context of a kvm page
fault is ok, because the page fault would be inhibited by the mmu
notifier invalidates, so maybe it's safe.

mmu_sync_children seems to have a problem instead, in your tree
get_dirty_log also has an issue if it has been updated to skip the
flush on readonly sptes, like I guess.

Interesting how the spte is already non present, the page is just
being freed shortly later, but yet we still need to trigger write
faults synchronously and prevent other CPUs in guest mode to further
modify the page to avoid losing dirty bits updates or updates on
pagetables that maps pagetables in the not NPT/EPT case. If the page
was really only going to be freed it would be ok if the other cpus
would still write to it for a little longer until the page was freed.

Like I wrote in previous email, I was thinking if we'd change the mmu
notifier methods to do an unconditional flush, then every other flush
could also run outside of the mmu_lock. But then I didn't think enough
about this to be sure. My guess is we could move all flushes outside
the mmu_lock if we stop flushling the tlb conditonally to the current
spte values. It'd clearly be slower for an UP guest though :). Large
SMP guests might benefit, if that is feasible at all... It depends how
problematic the mmu_lock is on the large SMP guests and how much we're
saving by doing conditional TLB flushes.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html