Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Tue, 14 Feb 2012 15:29:47 -0200

On Tue, Feb 14, 2012 at 06:10:44PM +0100, Andrea Arcangeli wrote:
> On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote:
> > Other threads may process the same page in that small window and skip
> > TLB flush and then return before these functions do flush.
> 
> It's correct to flush the shadow MMU TLB without the mmu_lock only in
> the context of mmu notifier methods. So the below while won't hurt,
> it's performance regression and shouldn't be applied (and
> it obfuscates the code by not being strict anymore).
> 
> To the contrary every other place that does a shadow/secondary MMU smp
> tlb flush _must_ happen inside the mmu_lock, otherwise the
> serialization isn't correct anymore against the very below mmu_lock in
> the below quoted patch taken by
> kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start.
> 
> The explanation is in commit 4539b35881ae9664b0e2953438dd83f5ee02c0b4.
> 
> I'll try to explain it more clearly: the moment you drop mmu_lock,
> pages can be freed. So if you invalidate a spte in any place inside
> the KVM code (except the mmu notifier methods where a reference of the
> page is implicitly hold by the caller and so the page can't go away
> under a mmu notifier method by design), then the below
> kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start
> won't get their need_tlb_flush set anymore, and they won't run the tlb
> flush before freeing the page.
> 
> So every other place (except mmu notifier) must flush the secondary
> MMU smp tlb _before_ releasing the mmu_lock.
> 
> Only mmu notifier is safe to flush the secondary MMU TLB _after_
> releasing the mmu_lock.
> 
> If we changed the mmu notifier methods to unconditionally flush the
> shadow TLB (regardless if a spte was present or not), we might not
> need to hold the mmu_lock in every tlb flush outside the context of
> the mmu notifier methods. But then the mmu notifier methods would
> become more expensive, I didn't evaluate fully what would be the side
> effects of such a change. Also note, only the
> kvm_mmu_notifier_invalidate_page and
> kvm_mmu_notifier_invalidate_range_start would need to do that, because
> they're the only two where the page reference gets dropped.
> 
> Even shorter: because the mmu notifier a implicit reference on the
> page exists and is hold by the caller, they can flush outside the
> mmu_lock. Every other place in KVM only holds an implicit valid
> reference on the page only as long as you hold the mmu_lock, or while
> a spte is still established.
> 
> Well it's not easy logic so it's not surprising it wasn't totally
> clear.
> 
> It's probably not heavily documented, and the fact you changed it
> still is still good so we refresh our minds on the exact rules of mmu
> notifier locking, thanks!

The problem the patch is fixing is not related to page freeing, but
rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d
(replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"):

During protecting pages for dirty logging, other threads may also try
to protect a page in mmu_sync_children() or kvm_mmu_get_page().

In such a case, because get_dirty_log releases mmu_lock before flushing
TLB's, the following race condition can happen:

  A (get_dirty_log)     B (another thread)

  lock(mmu_lock)
  clear pte.w
  unlock(mmu_lock)
                        lock(mmu_lock)
                        pte.w is already cleared
                        unlock(mmu_lock)
                        skip TLB flush
                        return
  ...
  TLB flush

Though thread B assumes the page has already been protected when it
returns, the remaining TLB entry will break that assumption.

> 
> Andrea
> 
> > 
> > Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@xxxxxxxxxxxxx>
> > ---
> >  virt/kvm/kvm_main.c |   19 ++++++++++---------
> >  1 files changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 470e305..2b4bc77 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> >  	 */
> >  	idx = srcu_read_lock(&kvm->srcu);
> >  	spin_lock(&kvm->mmu_lock);
> > +
> >  	kvm->mmu_notifier_seq++;
> >  	need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
> > -	spin_unlock(&kvm->mmu_lock);
> > -	srcu_read_unlock(&kvm->srcu, idx);
> > -
> >  	/* we've to flush the tlb before the pages can be freed */
> >  	if (need_tlb_flush)
> >  		kvm_flush_remote_tlbs(kvm);
> >  
> > +	spin_unlock(&kvm->mmu_lock);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> >  }
> >  
> >  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > @@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  	for (; start < end; start += PAGE_SIZE)
> >  		need_tlb_flush |= kvm_unmap_hva(kvm, start);
> >  	need_tlb_flush |= kvm->tlbs_dirty;
> > -	spin_unlock(&kvm->mmu_lock);
> > -	srcu_read_unlock(&kvm->srcu, idx);
> > -
> >  	/* we've to flush the tlb before the pages can be freed */
> >  	if (need_tlb_flush)
> >  		kvm_flush_remote_tlbs(kvm);
> > +
> > +	spin_unlock(&kvm->mmu_lock);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> >  }
> >  
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > @@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> >  
> >  	idx = srcu_read_lock(&kvm->srcu);
> >  	spin_lock(&kvm->mmu_lock);
> > -	young = kvm_age_hva(kvm, address);
> > -	spin_unlock(&kvm->mmu_lock);
> > -	srcu_read_unlock(&kvm->srcu, idx);
> >  
> > +	young = kvm_age_hva(kvm, address);
> >  	if (young)
> >  		kvm_flush_remote_tlbs(kvm);
> >  
> > +	spin_unlock(&kvm->mmu_lock);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> > +
> >  	return young;
> >  }
> >  
> > -- 
> > 1.7.5.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html