Re: [PATCH v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG

David Matlack <dmatlack@xxxxxxxxxx> · Thu, 4 Apr 2024 11:12:40 -0700

On Thu, Apr 4, 2024 at 10:10 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Thu, Apr 04, 2024, David Matlack wrote:
> > On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@xxxxxxxxxxx> wrote:
> > > > This change eliminates dips in guest performance during live migration
> > > > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
> > > > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
> > > Frequently drop/reacquire mmu_lock will cause userspace migration
> > > process issuing CLEAR ioctls to contend with 160 vCPU, migration speed
> > > maybe become slower.
> >
> > In practice we have not found this to be the case. With this patch
> > applied we see a significant improvement in guest workload throughput
> > while userspace is issuing CLEAR ioctls without any change to the
> > overall migration duration.
>
> ...
>
> > In the case of this patch, there doesn't seem to be a trade-off. We
> > see an improvement to vCPU performance without any regression in
> > migration duration or other metrics.
>
> For x86.  We need to keep in mind that not all architectures have x86's optimization
> around dirty logging faults, or around faults in general. E.g. LoongArch's (which
> I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock.
> And if the fault can't be handled in the fast path, KVM will actually acquire
> mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after
> the mmu_seq and fault-in pfn stuff).
>
> So for x86, I think we can comfortably state that this change is a net positive
> for all scenarios.  But for other architectures, that might not be the case.
> I'm not saying this isn't a good change for other architectures, just that we
> don't have relevant data to really know for sure.

I do not have data for other architectures, but may be able to get
data on ARM in the next few weeks. I believe we saw similar benefits
when testing on ARM.

>
> Absent performance data for other architectures, which is likely going to be
> difficult/slow to get, it might make sense to have this be opt-in to start.  We
> could even do it with minimal #ifdeffery, e.g. something like the below would allow
> x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked()
> (I assume we want to give kvm_get_dirty_log_protect() similar treatment?).

I don't see any reason not to give kvm_get_dirty_log_protect() the
same treatment, but it's less important since
kvm_get_dirty_log_protect() does not take the mmu_lock at all when
manual-protect is enabled.

>
> I don't love the idea of adding more arch specific MMU behavior (going the wrong
> direction), but it doesn't seem like an unreasonable approach in this case.

I wonder if this is being overly cautious. I would expect only more
benefit on architectures that more aggressively take the mmu_lock on
vCPU threads during faults. The more lock acquisition on vCPU threads,
the more this patch will help reduce vCPU starvation during
CLEAR_DIRTY_LOG.

Hm, perhaps testing with ept=N (which will use the write-lock for even
dirty logging faults) would be a way to increase confidence in the
effect on other architectures?

>
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 86d267db87bb..5eb1ce83f29d 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -66,9 +66,9 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
>         if (!memslot || (offset + __fls(mask)) >= memslot->npages)
>                 return;
>
> -       KVM_MMU_LOCK(kvm);
> +       KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>         kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> -       KVM_MMU_UNLOCK(kvm);
> +       KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>  }
>
>  int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d1fd9cb5d037..74ae844e4ed0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2279,7 +2279,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>                 dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
>                 memset(dirty_bitmap_buffer, 0, n);
>
> -               KVM_MMU_LOCK(kvm);
> +               KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>                 for (i = 0; i < n / sizeof(long); i++) {
>                         unsigned long mask;
>                         gfn_t offset;
> @@ -2295,7 +2295,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>                         kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
>                                                                 offset, mask);
>                 }
> -               KVM_MMU_UNLOCK(kvm);
> +               KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>         }
>
>         if (flush)
> @@ -2390,7 +2390,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>         if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
>                 return -EFAULT;
>
> -       KVM_MMU_LOCK(kvm);
> +       KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>         for (offset = log->first_page, i = offset / BITS_PER_LONG,
>                  n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
>              i++, offset += BITS_PER_LONG) {
> @@ -2413,7 +2413,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>                                                                 offset, mask);
>                 }
>         }
> -       KVM_MMU_UNLOCK(kvm);
> +       KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>
>         if (flush)
>                 kvm_flush_remote_tlbs_memslot(kvm, memslot);
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index ecefc7ec51af..39d8b809c303 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,6 +20,11 @@
>  #define KVM_MMU_UNLOCK(kvm)            spin_unlock(&(kvm)->mmu_lock)
>  #endif /* KVM_HAVE_MMU_RWLOCK */
>
> +#ifndef KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT
> +#define KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT     KVM_MMU_LOCK
> +#define KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT   KVM_MMU_UNLOCK
> +#endif
> +
>  kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>                      bool *async, bool write_fault, bool *writable);
>
>