Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall

David Matlack <dmatlack@xxxxxxxxxx> · Tue, 30 Nov 2021 00:50:12 +0000

On Tue, Nov 30, 2021 at 12:48:24AM +0000, David Matlack wrote:
> On Wed, Nov 24, 2021 at 09:44:21PM +0000, Mingwei Zhang wrote:
> > TDP MMU SPTE zapping process currently uses two levels of iterations. The
> > first level iteration happens at the for loop within the zap_gfn_range()
> > with the purpose of calibrating the accurate range for zapping. The second
> > level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the
> 
> iteration
> 
> > whole paging structures (leaf and non-leaf SPTEs) within the range. The
> > former iteration is yield safe, while the second one is not.
> 
> I know what you mean but I'd suggest being more specific than "yield
> safe". For example:
> 
>   Unlike the outer loop, the recursive zapping done under
>   tdp_mmu_set_spte{,_atomic} does not yield. Since zapping is done with
>   a pre-order traversal, zapping sufficiently large ranges can lead to
>   RCU stall warnings.
> 
> I'd also clarify here that the TDP MMU iterator uses a pre-order
> traversal which causes us KVM to end up doing the maximum amount of
> zapping under tdp_mmu_set_spte{,_atomic} and not the outer for loop.

(Ah sorry for the redundant suggestion. I wrote this paragraph and then
reworded my suggested wording to mention the pre-order traversal.)

> 
> > 
> > In many cases, zapping SPTE process could be optimized since the non-leaf
> > SPTEs could most likely be retained for the next allocation. On the other
> > hand, for large scale SPTE zapping scenarios, we may end up zapping too
> > many SPTEs and use excessive CPU time that causes the RCU stall warning.
> > 
> > The follow selftest reproduces the warning:
> > 
> > 	(env: kvm.tdp_mmu=Y)
> > 	./dirty_log_perf_test -v 64 -b 8G
> > 
> > Optimize the zapping process by skipping all SPTEs above a certain level in
> > the first iteration. This allows us to control the granularity of the
> > actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition,
> > we would retain some of the non-leaf SPTEs to accelerate next allocation.
> > 
> > For the selection of the `certain level`, we choose the PG_LEVEL_1G because
> > it is currently the largest page size supported and it natually fits the
> > scenario of splitting large pages.
> > 
> > For `zap_all` case (usually) at VM teardown time, we use a two-phase
> > mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase
> > zaps everything else. This is achieved by the helper function
> > __zap_gfn_range().
> > 
> > Cc: Sean Christopherson <seanjc@xxxxxxxxxx>
> > Cc: Ben Gardon <bgardon@xxxxxxxxxx>
> > Cc: David Matlack <dmatlack@xxxxxxxxxx>
> > 
> > Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------
> >  1 file changed, 40 insertions(+), 17 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 89d16bb104de..3fadc51c004a 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> >   * account for the possibility that other threads are modifying the paging
> >   * structures concurrently. If shared is false, this thread should hold the
> >   * MMU lock in write mode.
> > + *
> > + * If zap_all is true, eliminate all the paging structures that contains the
> > + * SPTEs.
> >   */
> > -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > -			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> > -			  bool shared)
> > +static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > +			    gfn_t start, gfn_t end, bool can_yield, bool flush,
> > +			    bool shared, bool zap_all)
> >  {
> > -	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> > -	bool zap_all = (start == 0 && end >= max_gfn_host);
> >  	struct tdp_iter iter;
> >  
> > -	/*
> > -	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > -	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > -	 * and so KVM will never install a SPTE for such addresses.
> > -	 */
> > -	end = min(end, max_gfn_host);
> > -
> > -	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > -
> >  	rcu_read_lock();
> >  
> >  	tdp_root_for_each_pte(iter, root, start, end) {
> > @@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >  			continue;
> >  		}
> >  
> > -		if (!is_shadow_present_pte(iter.old_spte))
> > +		/*
> > +		 * In zap_all case, ignore the checking of present since we have
> > +		 * to zap everything.
> > +		 */
> > +		if (!zap_all && !is_shadow_present_pte(iter.old_spte))
> >  			continue;
> 
> I don't believe there's any reason to attempt to zap a non-present spte,
> even in the zap_all case. In any case, this change deserves its own
> patch and a commit message that describes why the old logic is incorrect
> and how this fixes it.
> 
> >  
> >  		/*
> >  		 * If this is a non-last-level SPTE that covers a larger range
> >  		 * than should be zapped, continue, and zap the mappings at a
> > -		 * lower level, except when zapping all SPTEs.
> > +		 * lower level. Actual zapping started at proper granularity
> > +		 * that is not so large as to cause a soft lockup when handling
> > +		 * the changed pte (which does not yield).
> >  		 */
> >  		if (!zap_all &&
> >  		    (iter.gfn < start ||
> > -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> > +		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> > +		     iter.level > PG_LEVEL_1G) &&
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> 
> This if statement is getting a bit long. I'd suggest breaking out the
> level check and also using KVM_MAX_HUGEPAGE_LEVEL.
> 
> e.g.
> 
>         /*
>          * If not doing zap_all, only zap up to the huge page level to
>          * avoid doing too much work in the recursive tdp_mmu_set_spte*
>          * call below, since it does not yield.
>          *
>          * This will potentially leave behind some childless page tables
>          * but that's ok because ...
>          */
>          if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
>                 continue;
> 
> And on that note, what is the reasoning for why it's ok to leave behind
> childless page tables? I assume it's because most of the time we'll use
> that page table again in the future, and at worst we leave the page
> table allocated until the VM is cleaned up?
> 
> >  
> > @@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >  	return flush;
> >  }
> >  
> > +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > +			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> > +			  bool shared)
> > +{
> > +	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> > +	bool zap_all = (start == 0 && end >= max_gfn_host);
> > +
> > +	/*
> > +	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > +	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > +	 * and so KVM will never install a SPTE for such addresses.
> > +	 */
> > +	end = min(end, max_gfn_host);
> > +
> > +	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > +
> > +	flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared,
> > +				false);
> > +	if (zap_all)
> > +		flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush,
> > +					shared, true);
> > +	return flush;
> > +}
> > +
> >  /*
> >   * Tears down the mappings for the range of gfns, [start, end), and frees the
> >   * non-root pages mapping GFNs strictly within that range. Returns true if
> > -- 
> > 2.34.0.rc2.393.gf8c9666880-goog
> >