Re: [PATCH] KVM: mmu: spte_write_protect optimization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



+Ben and David

On Wed, May 25, 2022, Venkatesh Srinivas wrote:
> From: Junaid Shahid <junaids@xxxxxxxxxx>
> 
> This change uses a lighter-weight function instead of mmu_spte_update()
> in the common case in spte_write_protect(). This helps speed up the
> get_dirty_log IOCTL.

When upstreaming our internal patches, please rewrite the changelogs to meet
upstream standards.  If that requires a lot of effort, then by all means take
credit with a Co-developed-by or even making yourself the sole author with a
Suggested-by for the original author.

There is subtly a _lot_ going on in this patch, and practically none of it is
explained.

> Performance: dirty_log_perf_test with 32 GB VM size
>              Avg IOCTL time over 10 passes
>              Haswell: ~0.23s vs ~0.4s
>              IvyBridge: ~0.8s vs 1s
> 
> Signed-off-by: Venkatesh Srinivas <venkateshs@xxxxxxxxxxxx>
> Signed-off-by: Junaid Shahid <junaids@xxxxxxxxxx>
> ---
>  arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index efe5a3dca1e0..a6db9dfaf7c3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1151,6 +1151,22 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>  	}
>  }
>  
> +static bool spte_test_and_clear_writable(u64 *sptep)
> +{
> +	u64 spte = *sptep;
> +
> +	if (spte & PT_WRITABLE_MASK) {

This is redundant, the caller has already verified that the spte is writable.

> +		clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);

Is an unconditional atomic op faster than checking spte_has_volatile_bits()?  If
so, that should be documented with a comment.

This also needs a comment explaining the mmu_lock is held for write and so
WRITABLE can't be cleared, i.e. using clear_bit() instead of test_and_clear_bit()
is acceptable.

> +		if (!spte_ad_enabled(spte))

This absolutely needs a comment explaining what's going on.  Specifically that if
A/D bits are disabled, the WRITABLE bit doubles as the dirty flag, whereas hardware's
DIRTY bit is untouched when A/D bits are in use.

> +			kvm_set_pfn_dirty(spte_to_pfn(spte));
> +
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
>   * spte write-protection is caused by protecting shadow page table.
> @@ -1174,11 +1190,11 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
>  
>  	rmap_printk("spte %p %llx\n", sptep, *sptep);
>  
> -	if (pt_protect)
> -		spte &= ~shadow_mmu_writable_mask;
> -	spte = spte & ~PT_WRITABLE_MASK;
> -
> -	return mmu_spte_update(sptep, spte);
> +	if (pt_protect) {
> +		spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
> +		return mmu_spte_update(sptep, spte);
> +	}
> +	return spte_test_and_clear_writable(sptep);

I think rather open code the logic instead of adding a helper.  Without even more
comments, it's not at all obvious when it's safe to use spte_test_and_clear_writable().
And although this patch begs the question of whether or not we should do a similar
thing for the TDP MMU's clear_dirty_gfn_range(), the TDP MMU and legacy MMU can't
really share a helper at this time because the TDP MMU only holds mmu_lock for read.

So minus all the comments that need to be added, I think this can just be:

---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efe5a3dca1e0..caf5db7f1dce 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1174,11 +1174,16 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)

 	rmap_printk("spte %p %llx\n", sptep, *sptep);

-	if (pt_protect)
-		spte &= ~shadow_mmu_writable_mask;
-	spte = spte & ~PT_WRITABLE_MASK;
+	if (pt_protect) {
+		spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
+		return mmu_spte_update(sptep, spte);
+	}

-	return mmu_spte_update(sptep, spte);
+	clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
+	if (!spte_ad_enabled(spte))
+		kvm_set_pfn_dirty(spte_to_pfn(spte));
+
+	return true;
 }

 static bool rmap_write_protect(struct kvm_rmap_head *rmap_head,

base-commit: 9fd329c0c3309a87821817ca553b449936e318a9
--


And then on the TDP MMU side, I _think_ we can do the following (again, minus
comments).  The logic being that if something else races and clears the bit or
zaps the SPTEs, there's no point in "retrying" because either the !PRESENT check
and/or the WRITABLE/DIRTY bit check will fail.

The only hiccup would be clearing a REMOVED_SPTE bit, but by sheer dumb luck, past
me chose a semi-arbitrary value for REMOVED_SPTE that doesn't use either dirty bit.

This would need to be done over two patches, e.g. to make the EPT definitions
available to the mmu.

Ben, David, any holes in my idea?

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 841feaa48be5..c28a0b9e2af1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1600,6 +1600,8 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
        }
 }

+#define VMX_EPT_DIRTY_BIT BIT_ULL(9)
+
 /*
  * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
  * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
@@ -1610,35 +1612,32 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
                           gfn_t start, gfn_t end)
 {
-       struct tdp_iter iter;
-       u64 new_spte;
+       const int dirty_shift = ilog2(shadow_dirty_mask);
        bool spte_set = false;
+       struct tdp_iter iter;
+       int bit_nr;
+
+       BUILD_BUG_ON(REMOVED_SPTE & PT_DIRTY_MASK);
+       BUILD_BUG_ON(REMOVED_SPTE & VMX_EPT_DIRTY_BIT);

        rcu_read_lock();

        tdp_root_for_each_leaf_pte(iter, root, start, end) {
-retry:
                if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
                        continue;

                if (!is_shadow_present_pte(iter.old_spte))
                        continue;

-               if (spte_ad_need_write_protect(iter.old_spte)) {
-                       if (is_writable_pte(iter.old_spte))
-                               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
-                       else
-                               continue;
-               } else {
-                       if (iter.old_spte & shadow_dirty_mask)
-                               new_spte = iter.old_spte & ~shadow_dirty_mask;
-                       else
-                               continue;
-               }
+               if (spte_ad_need_write_protect(iter.old_spte))
+                       bit_nr = PT_WRITABLE_SHIFT;
+               else
+                       bit_nr = dirty_shift;

-               if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
-                       goto retry;
+               if (!test_and_clear_bit(bit_nr, (unsigned long *)iter.sptep))
+                       continue;

+               kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
                spte_set = true;
        }




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux