On Wed, 2013-08-07 at 15:33 +0530, Bharat Bhushan wrote: > When the MM code is invalidating a range of pages, it calls the KVM > kvm_mmu_notifier_invalidate_range_start() notifier function, which calls > kvm_unmap_hva_range(), which arranges to flush all the TLBs for guest pages. > However, the Linux PTEs for the range being flushed are still valid at > that point. We are not supposed to establish any new references to pages > in the range until the ...range_end() notifier gets called. > The PPC-specific KVM code doesn't get any explicit notification of that; > instead, we are supposed to use mmu_notifier_retry() to test whether we > are or have been inside a range flush notifier pair while we have been > referencing a page. > > This patch calls the mmu_notifier_retry() while mapping the guest > page to ensure we are not referencing a page when in range invalidation. > > This call is inside a region locked with kvm->mmu_lock, which is the > same lock that is called by the KVM MMU notifier functions, thus > ensuring that no new notification can proceed while we are in the > locked region. > > Signed-off-by: Bharat Bhushan <bharat.bhushan@xxxxxxxxxxxxx> > --- > arch/powerpc/kvm/e500_mmu_host.c | 19 +++++++++++++++++-- > 1 files changed, 17 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c > index ff6dd66..ae4eaf6 100644 > --- a/arch/powerpc/kvm/e500_mmu_host.c > +++ b/arch/powerpc/kvm/e500_mmu_host.c > @@ -329,8 +329,14 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, > int tsize = BOOK3E_PAGESZ_4K; > unsigned long tsize_pages = 0; > pte_t *ptep; > - int wimg = 0; > + int wimg = 0, ret = 0; > pgd_t *pgdir; > + unsigned long mmu_seq; > + struct kvm *kvm = vcpu_e500->vcpu.kvm; > + > + /* used to check for invalidations in progress */ > + mmu_seq = kvm->mmu_notifier_seq; > + smp_rmb(); > > /* > * Translate guest physical to true physical, acquiring > @@ -458,6 +464,13 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, > (long)gfn, pfn); > return -EINVAL; > } > + > + spin_lock(&kvm->mmu_lock); > + if (mmu_notifier_retry(kvm, mmu_seq)) { > + ret = -EAGAIN; > + goto out; > + } > + > kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg); > > kvmppc_e500_setup_stlbe(&vcpu_e500->vcpu, gtlbe, tsize, > @@ -466,10 +479,12 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, > /* Clear i-cache for new pages */ > kvmppc_mmu_flush_icache(pfn); > > +out: > + spin_unlock(&kvm->mmu_lock); > /* Drop refcount on page, so that mmu notifiers can clear it */ > kvm_release_pfn_clean(pfn); > > - return 0; > + return ret; > } Acked-by: Scott Wood <scottwood@xxxxxxxxxxxxx> since it's currently the standard KVM approach, though I'm not happy about the busy-waiting aspect of it. What if we preempted the thread responsible for decrementing mmu_notifier_count? What if we did so being a SCHED_FIFO task of higher priority than the decrementing thread? -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html