Re: [PATCH] KVM: mmu: spte_write_protect optimization

David Matlack <dmatlack@xxxxxxxxxx> · Thu, 26 May 2022 17:18:51 +0000

On Wed, May 25, 2022 at 08:07:05PM +0000, Sean Christopherson wrote:
> +Ben and David
> 
> On Wed, May 25, 2022, Venkatesh Srinivas wrote:
> > From: Junaid Shahid <junaids@xxxxxxxxxx>
> > 
> > This change uses a lighter-weight function instead of mmu_spte_update()
> > in the common case in spte_write_protect(). This helps speed up the
> > get_dirty_log IOCTL.
> 
> When upstreaming our internal patches, please rewrite the changelogs to meet
> upstream standards.  If that requires a lot of effort, then by all means take
> credit with a Co-developed-by or even making yourself the sole author with a
> Suggested-by for the original author.
> 
> There is subtly a _lot_ going on in this patch, and practically none of it is
> explained.
> 
> > Performance: dirty_log_perf_test with 32 GB VM size
> >              Avg IOCTL time over 10 passes
> >              Haswell: ~0.23s vs ~0.4s
> >              IvyBridge: ~0.8s vs 1s
> > 
> > Signed-off-by: Venkatesh Srinivas <venkateshs@xxxxxxxxxxxx>
> > Signed-off-by: Junaid Shahid <junaids@xxxxxxxxxx>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++++-----
> >  1 file changed, 21 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index efe5a3dca1e0..a6db9dfaf7c3 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1151,6 +1151,22 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> >  	}
> >  }
> >  
> > +static bool spte_test_and_clear_writable(u64 *sptep)
> > +{
> > +	u64 spte = *sptep;
> > +
> > +	if (spte & PT_WRITABLE_MASK) {
> 
> This is redundant, the caller has already verified that the spte is writable.
> 
> > +		clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
> 
> Is an unconditional atomic op faster than checking spte_has_volatile_bits()?  If
> so, that should be documented with a comment.
> 
> This also needs a comment explaining the mmu_lock is held for write and so
> WRITABLE can't be cleared, i.e. using clear_bit() instead of test_and_clear_bit()
> is acceptable.
> 
> > +		if (!spte_ad_enabled(spte))
> 
> This absolutely needs a comment explaining what's going on.  Specifically that if
> A/D bits are disabled, the WRITABLE bit doubles as the dirty flag, whereas hardware's
> DIRTY bit is untouched when A/D bits are in use.
> 
> > +			kvm_set_pfn_dirty(spte_to_pfn(spte));
> > +
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> >  /*
> >   * Write-protect on the specified @sptep, @pt_protect indicates whether
> >   * spte write-protection is caused by protecting shadow page table.
> > @@ -1174,11 +1190,11 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
> >  
> >  	rmap_printk("spte %p %llx\n", sptep, *sptep);
> >  
> > -	if (pt_protect)
> > -		spte &= ~shadow_mmu_writable_mask;
> > -	spte = spte & ~PT_WRITABLE_MASK;
> > -
> > -	return mmu_spte_update(sptep, spte);
> > +	if (pt_protect) {
> > +		spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
> > +		return mmu_spte_update(sptep, spte);
> > +	}
> > +	return spte_test_and_clear_writable(sptep);
> 
> I think rather open code the logic instead of adding a helper.  Without even more
> comments, it's not at all obvious when it's safe to use spte_test_and_clear_writable().
> And although this patch begs the question of whether or not we should do a similar
> thing for the TDP MMU's clear_dirty_gfn_range(), the TDP MMU and legacy MMU can't
> really share a helper at this time because the TDP MMU only holds mmu_lock for read.
> 
> So minus all the comments that need to be added, I think this can just be:
> 
> ---
>  arch/x86/kvm/mmu/mmu.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index efe5a3dca1e0..caf5db7f1dce 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1174,11 +1174,16 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
> 
>  	rmap_printk("spte %p %llx\n", sptep, *sptep);
> 
> -	if (pt_protect)
> -		spte &= ~shadow_mmu_writable_mask;
> -	spte = spte & ~PT_WRITABLE_MASK;
> +	if (pt_protect) {
> +		spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
> +		return mmu_spte_update(sptep, spte);
> +	}
> 
> -	return mmu_spte_update(sptep, spte);
> +	clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
> +	if (!spte_ad_enabled(spte))
> +		kvm_set_pfn_dirty(spte_to_pfn(spte));
> +
> +	return true;
>  }
> 
>  static bool rmap_write_protect(struct kvm_rmap_head *rmap_head,
> 
> base-commit: 9fd329c0c3309a87821817ca553b449936e318a9
> --
> 
> 
> And then on the TDP MMU side, I _think_ we can do the following (again, minus
> comments).  The logic being that if something else races and clears the bit or
> zaps the SPTEs, there's no point in "retrying" because either the !PRESENT check
> and/or the WRITABLE/DIRTY bit check will fail.
> 
> The only hiccup would be clearing a REMOVED_SPTE bit, but by sheer dumb luck, past
> me chose a semi-arbitrary value for REMOVED_SPTE that doesn't use either dirty bit.
> 
> This would need to be done over two patches, e.g. to make the EPT definitions
> available to the mmu.
> 
> Ben, David, any holes in my idea?

I think the idea will work. But I also wonder if we should just change
clear_dirty_gfn_range() to run under the write lock [1]. Then we could
share the code with the shadow MMU and avoid the REMOVED_SPTE
complexity.

[1] As a general rule, it seems like large (e.g. slot-wide)
single-threaded batch operations should run under the write lock.
They will be more efficient by avoiding atomic instructions, and can
still avoid lock contention via tdp_mmu_iter_cond_resched(). e.g. Ben
and I have been talking about changing eager page splitting to always
run under the write lock.

> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 841feaa48be5..c28a0b9e2af1 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1600,6 +1600,8 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
>         }
>  }
> 
> +#define VMX_EPT_DIRTY_BIT BIT_ULL(9)
> +
>  /*
>   * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
>   * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
> @@ -1610,35 +1612,32 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
>  static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>                            gfn_t start, gfn_t end)
>  {
> -       struct tdp_iter iter;
> -       u64 new_spte;
> +       const int dirty_shift = ilog2(shadow_dirty_mask);
>         bool spte_set = false;
> +       struct tdp_iter iter;
> +       int bit_nr;
> +
> +       BUILD_BUG_ON(REMOVED_SPTE & PT_DIRTY_MASK);
> +       BUILD_BUG_ON(REMOVED_SPTE & VMX_EPT_DIRTY_BIT);
> 
>         rcu_read_lock();
> 
>         tdp_root_for_each_leaf_pte(iter, root, start, end) {
> -retry:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>                         continue;
> 
>                 if (!is_shadow_present_pte(iter.old_spte))
>                         continue;
> 
> -               if (spte_ad_need_write_protect(iter.old_spte)) {
> -                       if (is_writable_pte(iter.old_spte))
> -                               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> -                       else
> -                               continue;
> -               } else {
> -                       if (iter.old_spte & shadow_dirty_mask)
> -                               new_spte = iter.old_spte & ~shadow_dirty_mask;
> -                       else
> -                               continue;
> -               }
> +               if (spte_ad_need_write_protect(iter.old_spte))
> +                       bit_nr = PT_WRITABLE_SHIFT;
> +               else
> +                       bit_nr = dirty_shift;
> 
> -               if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
> -                       goto retry;
> +               if (!test_and_clear_bit(bit_nr, (unsigned long *)iter.sptep))
> +                       continue;
> 
> +               kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
>                 spte_set = true;
>         }
>