Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Sun, 12 Apr 2009 19:31:58 -0300

Hi Andrea,

On Sat, Apr 11, 2009 at 06:48:54PM +0200, Andrea Arcangeli wrote:
> On Sun, Mar 29, 2009 at 01:36:01PM +0300, Avi Kivity wrote:
> > Marcelo, Andrea?
> 
> Had to read the code a bit more to understand the reason of the
> unsync_mmu flush in cr3 overwrite.
> 
> > Avi Kivity wrote:
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 2a36f7f..f0ea56c 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -1184,8 +1184,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
> >>  		for_each_sp(pages, sp, parents, i)
> >>  			protected |= rmap_write_protect(vcpu->kvm, sp->gfn);
> >>  -		if (protected)
> >> -			kvm_flush_remote_tlbs(vcpu->kvm);
> >> +		kvm_flush_remote_tlbs_cond(vcpu->kvm, protected);
> >>   		for_each_sp(pages, sp, parents, i) {
> >>  			kvm_sync_page(vcpu, sp);
> 
> Ok so because we didn't flush the tlb on the other vcpus when invlpg
> run, if cr3 overwrite needs to re-sync sptes wrprotecting them, we've
> to flush the tlb in all vcpus to be sure the possibly writable tlb
> entry reflecting the old writable spte instantiated before invlpg run,
> is removed from the physical cpus. We wouldn't find it in for_each_sp
> because it was rmap_removed, but we'll find something in
> mmu_unsync_walk (right? we definitely have to find something in
> mmu_unsync_walk for this to work, the parent sp have to leave
> child->unsync set even after rmap_remove run in invlpg without
> flushing the other vcpus tlbs).

mmu_sync_children needs to find any _reachable_ sptes that are unsync,
read the guest pagetable, and sync the sptes. Before it can inspect the
guest pagetable, it needs to write protect it, with rmap_write_protect.

In theory it wouldnt be necesarry to call
kvm_flush_remote_tlbs_cond(protected) here, but only
kvm_flush_remote_tlbs(), since the "kvm->remote_tlbs_dirty" information
is not pertinent to mmu_sync_children.

But this is done here to "clear" remote_tlbs_dirty (after a
kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an
optimization.

> >>  @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, 
> >> gva_t gva)
> >>  				rmap_remove(vcpu->kvm, sptep);
> >>  				if (is_large_pte(*sptep))
> >>  					--vcpu->kvm->stat.lpages;
> >> -				need_flush = 1;
> >> +				vcpu->kvm->remote_tlbs_dirty = true;
> >>  			}
> >>  			set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
> >>  			break;
> >> @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
> >> gva)
> >>  			break;
> >>  	}
> >>  -	if (need_flush)
> >> -		kvm_flush_remote_tlbs(vcpu->kvm);
> >>  	spin_unlock(&vcpu->kvm->mmu_lock);
> 
> AFIK to be compliant with lowlevel archs (without ASN it doesn't
> matter I think as vmx always flush on exit), we have to flush the
> local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests). I
> don't see why it's missing. Or am I wrong?

Caller does it:

void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
{
        vcpu->arch.mmu.invlpg(vcpu, gva);
        kvm_mmu_flush_tlb(vcpu);
        ++vcpu->stat.invlpg;
}

> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 68b217e..12afa50 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -758,10 +758,18 @@ static bool make_all_cpus_request(struct kvm *kvm, 
> >> unsigned int req)
> >>   void kvm_flush_remote_tlbs(struct kvm *kvm)
> >>  {
> >> +	kvm->remote_tlbs_dirty = false;
> >> +	smp_wmb();
> 
> Still no lock prefix to the asm insn and here it runs outside the
> mmu_lock, but ok, I tend to agree smp_wmb should be enough to be sure
> the write is fully finished by the time smb_wmb returns. There's
> another problem though.
> 
> CPU0				CPU1
> -----------			-------------
> remote_tlbs_dirty = false
> 				remote_tlbs_dirty = true
> smp_tlb_flush
> 				set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
> 
> 
> The flush for the sptep will be lost.

What about protecting remote_tlbs_dirty with mmu_lock? Only caller of
kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which
is not performance sensitive anyway.

> >> @@ -907,8 +913,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
> >> mmu_notifier *mn,
> >>  	young = kvm_age_hva(kvm, address);
> >>  	spin_unlock(&kvm->mmu_lock);
> >>  -	if (young)
> >> -		kvm_flush_remote_tlbs(kvm);
> >> +	kvm_flush_remote_tlbs_cond(kvm, young);
> >>   	return young;
> >>  }
> 
> No need to flush for clear_flush_young method, pages can't be freed
> there.
> 
> I mangled over the patch a bit, plus fixed the above smp race, let me
> know what you think.
> 
> The the best workload to exercise this is running a VM with lots of
> VCPUs and 8G of ram with a 32bit guest kernel and then just malloc and
> touch a byte for each 4096 page allocated by malloc. That will run a
> flood of invlpg. Then push the system to swap. while :; do cp /dev/hda
> /dev/null; done, also works without O_DIRECT so the host cache make it
> fast at the second run (not so much faster with host swapping though).
> 
> I only tested it so far with 12 VM on swap with 64bit kernels with
> heavy I/O so it's not good test as I doubt any invlpg runs, not even
> munmap(addr, 4k) uses invlpg.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> ---
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index d5bdf3a..900bc31 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1185,7 +1185,11 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
>  		for_each_sp(pages, sp, parents, i)
>  			protected |= rmap_write_protect(vcpu->kvm, sp->gfn);
>  
> -		if (protected)
> +		/*
> +		 * Avoid leaving stale tlb entries in this vcpu representing
> +		 * sptes rmap_removed by invlpg without vcpu smp tlb flush.
> +		 */
> +		if (protected || vcpu->kvm->remote_tlbs_dirty)
>  			kvm_flush_remote_tlbs(vcpu->kvm);

+void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond)
+{
+   if (cond || kvm->remote_tlbs_dirty)
+       kvm_flush_remote_tlbs(kvm);
+}

Aren't they the same?

>  
>  		for_each_sp(pages, sp, parents, i) {
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 09782a9..060b4a3 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -483,7 +483,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>  	}
>  
>  	if (need_flush)
> -		kvm_flush_remote_tlbs(vcpu->kvm);
> +		kvm_flush_local_tlb(vcpu);
>  	spin_unlock(&vcpu->kvm->mmu_lock);

No need, caller does it for us.

>  	if (pte_gpa == -1)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 095ebb6..731b846 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -125,6 +125,7 @@ struct kvm_kernel_irq_routing_entry {
>  struct kvm {
>  	struct mutex lock; /* protects the vcpus array and APIC accesses */
>  	spinlock_t mmu_lock;
> +	bool remote_tlbs_dirty; /* =true serialized by mmu_lock, =false OOO */

OOO ? Out Of Options?

> +void kvm_flush_local_tlb(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * This will guarantee that our current sptep is flushed from
> +	 * the tlb too, even if there's a kvm_flush_remote_tlbs
> +	 * running from under us (so if it's not running under
> +	 * mmu_lock like in the mmu notifier invocation).
> +	 */
> +	smp_wmb();
> +	/*
> +	 * If the below assignment get lost because of lack of atomic ops
> +	 * and one or more concurrent writers in kvm_flush_remote_tlbs
> +	 * we know that any set_shadow_pte preceding kvm_flush_local_tlb
> +	 * was visible to the other CPU, before KVM_REQ_TLB_FLUSH was set
> +	 * in each vcpu->requests by kvm_flush_remote_tlbs.
> +	 */
> +	vcpu->kvm->remote_tlbs_dirty = true;
> +
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +	/* get new asid before returning to guest mode */
> +	if (!test_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests))
> +		set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests);
> +#else
> +	/*
> +	 * Without CONFIG_MMU_NOTIFIER enabled this isn't enough but
> +	 * it will reduce the risk window at least.
> +	 */
> +	kvm_flush_remote_tlbs(vcpu->kvm);
> +#endif
> +}
> +
>  void kvm_flush_remote_tlbs(struct kvm *kvm)
>  {
> +	kvm->remote_tlbs_dirty = false;
> +	/*
> +	 * Guarantee that remote_tlbs_dirty is committed to memory
> +	 * before setting KVM_REQ_TLB_FLUSH.
> +	 */
> +	smp_wmb();
>  	if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
>  		++kvm->stat.remote_tlb_flush;
>  }

If remote_tlbs_dirty is protected by mmu_lock, most of this complexity 
is unecessary?

> @@ -810,6 +847,23 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  	return container_of(mn, struct kvm, mmu_notifier);
>  }
>  
> +static void kvm_mmu_notifier_tlb_flush(int need_tlb_flush, struct kvm *kvm)
> +{
> +	/*
> +	 * We've to flush the tlb before the pages can be freed.
> +	 *
> +	 * The important "remote_tlbs_dirty = true" that we can't miss
> +	 * always runs under mmu_lock before we taken it in the above
> +	 * spin_lock. Other than this we've to be sure not to set
> +	 * remote_tlbs_dirty to false inside kvm_flush_remote_tlbs
> +	 * unless we're going to flush the guest smp tlb for any
> +	 * relevant spte that has been invalidated just before setting
> +	 * "remote_tlbs_dirty = true".
> +	 */
> +	if (need_tlb_flush || kvm->remote_tlbs_dirty)
> +		kvm_flush_remote_tlbs(kvm);
> +}
> +
>  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
>  					     unsigned long address)
> @@ -840,10 +894,7 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  	need_tlb_flush = kvm_unmap_hva(kvm, address);
>  	spin_unlock(&kvm->mmu_lock);
>  
> -	/* we've to flush the tlb before the pages can be freed */
> -	if (need_tlb_flush)
> -		kvm_flush_remote_tlbs(kvm);
> -
> +	kvm_mmu_notifier_tlb_flush(need_tlb_flush, kvm);
>  }
>  
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> @@ -865,9 +916,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		need_tlb_flush |= kvm_unmap_hva(kvm, start);
>  	spin_unlock(&kvm->mmu_lock);
>  
> -	/* we've to flush the tlb before the pages can be freed */
> -	if (need_tlb_flush)
> -		kvm_flush_remote_tlbs(kvm);
> +	kvm_mmu_notifier_tlb_flush(need_tlb_flush, kvm);
>  }
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> @@ -888,7 +937,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  	 * The above sequence increase must be visible before the
>  	 * below count decrease but both values are read by the kvm
>  	 * page fault under mmu_lock spinlock so we don't need to add
> -	 * a smb_wmb() here in between the two.
> +	 * a smp_wmb() here in between the two.
>  	 */
>  	kvm->mmu_notifier_count--;
>  	spin_unlock(&kvm->mmu_lock);
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html