On Tue, Feb 14, 2012 at 06:10:44PM +0100, Andrea Arcangeli wrote: > On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote: > > Other threads may process the same page in that small window and skip > > TLB flush and then return before these functions do flush. > > It's correct to flush the shadow MMU TLB without the mmu_lock only in > the context of mmu notifier methods. So the below while won't hurt, > it's performance regression and shouldn't be applied (and > it obfuscates the code by not being strict anymore). > > To the contrary every other place that does a shadow/secondary MMU smp > tlb flush _must_ happen inside the mmu_lock, otherwise the > serialization isn't correct anymore against the very below mmu_lock in > the below quoted patch taken by > kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start. > > The explanation is in commit 4539b35881ae9664b0e2953438dd83f5ee02c0b4. > > I'll try to explain it more clearly: the moment you drop mmu_lock, > pages can be freed. So if you invalidate a spte in any place inside > the KVM code (except the mmu notifier methods where a reference of the > page is implicitly hold by the caller and so the page can't go away > under a mmu notifier method by design), then the below > kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start > won't get their need_tlb_flush set anymore, and they won't run the tlb > flush before freeing the page. > > So every other place (except mmu notifier) must flush the secondary > MMU smp tlb _before_ releasing the mmu_lock. > > Only mmu notifier is safe to flush the secondary MMU TLB _after_ > releasing the mmu_lock. > > If we changed the mmu notifier methods to unconditionally flush the > shadow TLB (regardless if a spte was present or not), we might not > need to hold the mmu_lock in every tlb flush outside the context of > the mmu notifier methods. But then the mmu notifier methods would > become more expensive, I didn't evaluate fully what would be the side > effects of such a change. Also note, only the > kvm_mmu_notifier_invalidate_page and > kvm_mmu_notifier_invalidate_range_start would need to do that, because > they're the only two where the page reference gets dropped. > > Even shorter: because the mmu notifier a implicit reference on the > page exists and is hold by the caller, they can flush outside the > mmu_lock. Every other place in KVM only holds an implicit valid > reference on the page only as long as you hold the mmu_lock, or while > a spte is still established. > > Well it's not easy logic so it's not surprising it wasn't totally > clear. > > It's probably not heavily documented, and the fact you changed it > still is still good so we refresh our minds on the exact rules of mmu > notifier locking, thanks! The problem the patch is fixing is not related to page freeing, but rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d (replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"): During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. > > Andrea > > > > > Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@xxxxxxxxxxxxx> > > --- > > virt/kvm/kvm_main.c | 19 ++++++++++--------- > > 1 files changed, 10 insertions(+), 9 deletions(-) > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 470e305..2b4bc77 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, > > */ > > idx = srcu_read_lock(&kvm->srcu); > > spin_lock(&kvm->mmu_lock); > > + > > kvm->mmu_notifier_seq++; > > need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty; > > - spin_unlock(&kvm->mmu_lock); > > - srcu_read_unlock(&kvm->srcu, idx); > > - > > /* we've to flush the tlb before the pages can be freed */ > > if (need_tlb_flush) > > kvm_flush_remote_tlbs(kvm); > > > > + spin_unlock(&kvm->mmu_lock); > > + srcu_read_unlock(&kvm->srcu, idx); > > } > > > > static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > > @@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > for (; start < end; start += PAGE_SIZE) > > need_tlb_flush |= kvm_unmap_hva(kvm, start); > > need_tlb_flush |= kvm->tlbs_dirty; > > - spin_unlock(&kvm->mmu_lock); > > - srcu_read_unlock(&kvm->srcu, idx); > > - > > /* we've to flush the tlb before the pages can be freed */ > > if (need_tlb_flush) > > kvm_flush_remote_tlbs(kvm); > > + > > + spin_unlock(&kvm->mmu_lock); > > + srcu_read_unlock(&kvm->srcu, idx); > > } > > > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, > > @@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, > > > > idx = srcu_read_lock(&kvm->srcu); > > spin_lock(&kvm->mmu_lock); > > - young = kvm_age_hva(kvm, address); > > - spin_unlock(&kvm->mmu_lock); > > - srcu_read_unlock(&kvm->srcu, idx); > > > > + young = kvm_age_hva(kvm, address); > > if (young) > > kvm_flush_remote_tlbs(kvm); > > > > + spin_unlock(&kvm->mmu_lock); > > + srcu_read_unlock(&kvm->srcu, idx); > > + > > return young; > > } > > > > -- > > 1.7.5.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html