Re: [PATCH v2 14/16] KVM: MMU: fast path of handling guest page fault

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Wed, 18 Apr 2012 20:08:37 -0300

On Wed, Apr 18, 2012 at 11:53:51AM +0800, Xiao Guangrong wrote:
> On 04/18/2012 09:47 AM, Marcelo Tosatti wrote:
> 
> > On Fri, Apr 13, 2012 at 06:16:33PM +0800, Xiao Guangrong wrote:
> >> If the the present bit of page fault error code is set, it indicates
> >> the shadow page is populated on all levels, it means what we do is
> >> only modify the access bit which can be done out of mmu-lock
> >>
> >> Currently, in order to simplify the code, we only fix the page fault
> >> caused by write-protect on the fast path
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxxxxxx>
> >> ---
> >>  arch/x86/kvm/mmu.c         |  205 ++++++++++++++++++++++++++++++++++++++++----
> >>  arch/x86/kvm/paging_tmpl.h |    3 +
> >>  2 files changed, 192 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index efa5d59..fc91667 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -446,6 +446,13 @@ static bool __check_direct_spte_mmio_pf(u64 spte)
> >>  }
> >>  #endif
> >>
> >> +static bool spte_wp_by_dirty_log(u64 spte)
> >> +{
> >> +	WARN_ON(is_writable_pte(spte));
> >> +
> >> +	return (spte & SPTE_ALLOW_WRITE) && !(spte & SPTE_WRITE_PROTECT);
> >> +}
> >> +
> >>  static bool spte_has_volatile_bits(u64 spte)
> >>  {
> >>  	if (!shadow_accessed_mask)
> >> @@ -454,9 +461,18 @@ static bool spte_has_volatile_bits(u64 spte)
> >>  	if (!is_shadow_present_pte(spte))
> >>  		return false;
> >>
> >> -	if ((spte & shadow_accessed_mask) &&
> >> -	      (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
> >> -		return false;
> >> +	if (spte & shadow_accessed_mask) {
> >> +		if (is_writable_pte(spte))
> >> +			return !(spte & shadow_dirty_mask);
> >> +
> >> +		/*
> >> +		 * If the spte is write-protected by dirty-log, it may
> >> +		 * be marked writable on fast page fault path, then CPU
> >> +		 * can modify the Dirty bit.
> >> +		 */
> >> +		if (!spte_wp_by_dirty_log(spte))
> >> +			return false;
> >> +	}
> >>
> >>  	return true;
> >>  }
> >> @@ -1109,26 +1125,18 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
> >>  		rmap_remove(kvm, sptep);
> >>  }
> >>
> >> -static bool spte_wp_by_dirty_log(u64 spte)
> >> -{
> >> -	WARN_ON(is_writable_pte(spte));
> >> -
> >> -	return (spte & SPTE_ALLOW_WRITE) && !(spte & SPTE_WRITE_PROTECT);
> >> -}
> >> -
> >>  static void spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
> >>  			       bool *flush, bool page_table_protect)
> >>  {
> >>  	u64 spte = *sptep;
> >>
> >>  	if (is_writable_pte(spte)) {
> >> -		*flush |= true;
> >> -
> >>  		if (large) {
> >>  			pgprintk("rmap_write_protect(large): spte %p %llx\n",
> >>  				 spte, *spte);
> >>  			BUG_ON(!is_large_pte(spte));
> >>
> >> +			*flush |= true;
> >>  			drop_spte(kvm, sptep);
> >>  			--kvm->stat.lpages;
> >>  			return;
> >> @@ -1137,6 +1145,9 @@ static void spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
> >>  		goto reset_spte;
> >>  	}
> >>
> >> +	/* We need flush tlbs in this case: the fast page fault path
> >> +	 * can mark the spte writable after we read the sptep.
> >> +	 */
> >>  	if (page_table_protect && spte_wp_by_dirty_log(spte))
> >>  		goto reset_spte;
> >>
> >> @@ -1144,6 +1155,8 @@ static void spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
> >>
> >>  reset_spte:
> >>  	rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
> >> +
> >> +	*flush |= true;
> >>  	spte = spte & ~PT_WRITABLE_MASK;
> >>  	if (page_table_protect)
> >>  		spte |= SPTE_WRITE_PROTECT;
> >> @@ -2767,18 +2780,172 @@ exit:
> >>  	return ret;
> >>  }
> >>
> >> +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn,
> >> +				   u32 error_code)
> >> +{
> >> +	unsigned long *rmap;
> >> +
> >> +	/*
> >> +	 * #PF can be fast only if the shadow page table is present and it
> >> +	 * is caused by write-protect, that means we just need change the
> >> +	 * W bit of the spte which can be done out of mmu-lock.
> >> +	 */
> >> +	if (!(error_code & PFERR_PRESENT_MASK) ||
> >> +	      !(error_code & PFERR_WRITE_MASK))
> >> +		return false;
> >> +
> >> +	rmap = gfn_to_rmap(vcpu->kvm, gfn, PT_PAGE_TABLE_LEVEL);
> >> +
> >> +	/* Quickly check the page can be writable. */
> >> +	if (test_bit(PTE_LIST_WP_BIT, ACCESS_ONCE(rmap)))
> >> +		return false;
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +static bool
> >> +fast_pf_fix_indirect_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >> +			  u64 *sptep, u64 spte, gfn_t gfn)
> >> +{
> >> +	pfn_t pfn;
> >> +	bool ret = false;
> >> +
> >> +	/*
> >> +	 * For the indirect spte, it is hard to get a stable gfn since
> >> +	 * we just use a cmpxchg to avoid all the races which is not
> >> +	 * enough to avoid the ABA problem: the host can arbitrarily
> >> +	 * change spte and the mapping from gfn to pfh.
> >> +	 *
> >> +	 * What we do is call gfn_to_pfn_atomic to bind the gfn and the
> >> +	 * pfn because after the call:
> >> +	 * - we have held the refcount of pfn that means the pfn can not
> >> +	 *   be freed and be reused for another gfn.
> >> +	 * - the pfn is writable that means it can not be shared by different
> >> +	 *   gfn.
> >> +	 */
> >> +	pfn = gfn_to_pfn_atomic(vcpu->kvm, gfn);
> > 
> > Please document what can happen in parallel whenever you manipulate
> > sptes without mmu_lock held, convincing the reader that this is safe.
> > 
> 
> 
> OK, I will documents it in locking.txt
> 
> I am not good at documenting, How about the below description?
> 
> What we use to avoid all the race is spte.SPTE_ALLOW_WRITE and spte.SPTE_WRITE_PROTECT:
> SPTE_ALLOW_WRITE means the gfn is writable on both guest and host, and SPTE_ALLOW_WRITE
> means this gfn is not write-protected for shadow page write protection.
> 
> On fast page fault path, we will atomically set the spte.W bit if
> spte.SPTE_WRITE_PROTECT = 1 and spte.SPTE_WRITE_PROTECT = 0, this is safe because whenever
> changing these bits can be detected by cmpxchg.
> 
> But we need carefully check the mapping between gfn to pfn since we can only ensure the pfn
> is not changed during cmpxchg. This is a ABA problem, for example, below case will happen:
> 
> At the beginning:
> gpte = gfn1
> gfn1 is mapped to pfn1 on host
> spte is the shadow page table entry corresponding with gpte and
> spte = pfn1
> 
>    VCPU 0                                           VCPU0
> on page fault path:
> 
>    old_spte = *spte;
>                                                  pfn1 is swapped out:
>                                                              spte = 0;
>                                                  pfn1 is re-alloced for gfn2
>                                                  gpte is changed by host, and pointing to gfn2:
>                                                              spte = pfn1;
> 
>    if (cmpxchg(spte, old_spte, old_spte+W)
> 	mark_page_dirty(vcpu->kvm, gfn1)
>   	OOPS!!!
>    we dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
> 
> For direct sp, we can easily avoid it since the spte of direct sp is fixed to gfn.
> For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic to pin gfn to pfn,
> because after gfn_to_pfn_atomic:
> - we have held the refcount of pfn that means the pfn can not
>   be freed and be reused for another gfn.
> - the pfn is writable that means it can not be shared by different
>   gfn by KSM.
> Then, we can ensure the dirty bitmaps is correctly set for a gfn.

This is one possible scenario, OK. If you can list all possibilities,
better.

We need to check every possible case carefully.

> > Same with current users of walk_shadow_page_lockless_begin.
> 
> 
> After walk_shadow_page_lockless_begin, it is safe since reader_counter has been
> increased:
> 
> static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
> {
> 	rcu_read_lock();
> 	atomic_inc(&vcpu->kvm->arch.reader_counter);
> 
> 	/* Increase the counter before walking shadow page table */
> 	smp_mb__after_atomic_inc();
> }
> 
> It is not enough?

It is, i don't know what i was talking about.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html