On Thu, Apr 4, 2024 at 10:10 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, Apr 04, 2024, David Matlack wrote: > > On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@xxxxxxxxxxx> wrote: > > > > This change eliminates dips in guest performance during live migration > > > > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with > > > > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which > > > Frequently drop/reacquire mmu_lock will cause userspace migration > > > process issuing CLEAR ioctls to contend with 160 vCPU, migration speed > > > maybe become slower. > > > > In practice we have not found this to be the case. With this patch > > applied we see a significant improvement in guest workload throughput > > while userspace is issuing CLEAR ioctls without any change to the > > overall migration duration. > > ... > > > In the case of this patch, there doesn't seem to be a trade-off. We > > see an improvement to vCPU performance without any regression in > > migration duration or other metrics. > > For x86. We need to keep in mind that not all architectures have x86's optimization > around dirty logging faults, or around faults in general. E.g. LoongArch's (which > I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock. > And if the fault can't be handled in the fast path, KVM will actually acquire > mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after > the mmu_seq and fault-in pfn stuff). > > So for x86, I think we can comfortably state that this change is a net positive > for all scenarios. But for other architectures, that might not be the case. > I'm not saying this isn't a good change for other architectures, just that we > don't have relevant data to really know for sure. I do not have data for other architectures, but may be able to get data on ARM in the next few weeks. I believe we saw similar benefits when testing on ARM. > > Absent performance data for other architectures, which is likely going to be > difficult/slow to get, it might make sense to have this be opt-in to start. We > could even do it with minimal #ifdeffery, e.g. something like the below would allow > x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked() > (I assume we want to give kvm_get_dirty_log_protect() similar treatment?). I don't see any reason not to give kvm_get_dirty_log_protect() the same treatment, but it's less important since kvm_get_dirty_log_protect() does not take the mmu_lock at all when manual-protect is enabled. > > I don't love the idea of adding more arch specific MMU behavior (going the wrong > direction), but it doesn't seem like an unreasonable approach in this case. I wonder if this is being overly cautious. I would expect only more benefit on architectures that more aggressively take the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU threads, the more this patch will help reduce vCPU starvation during CLEAR_DIRTY_LOG. Hm, perhaps testing with ept=N (which will use the write-lock for even dirty logging faults) would be a way to increase confidence in the effect on other architectures? > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > index 86d267db87bb..5eb1ce83f29d 100644 > --- a/virt/kvm/dirty_ring.c > +++ b/virt/kvm/dirty_ring.c > @@ -66,9 +66,9 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask) > if (!memslot || (offset + __fls(mask)) >= memslot->npages) > return; > > - KVM_MMU_LOCK(kvm); > + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask); > - KVM_MMU_UNLOCK(kvm); > + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); > } > > int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size) > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index d1fd9cb5d037..74ae844e4ed0 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2279,7 +2279,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) > dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot); > memset(dirty_bitmap_buffer, 0, n); > > - KVM_MMU_LOCK(kvm); > + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); > for (i = 0; i < n / sizeof(long); i++) { > unsigned long mask; > gfn_t offset; > @@ -2295,7 +2295,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, > offset, mask); > } > - KVM_MMU_UNLOCK(kvm); > + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); > } > > if (flush) > @@ -2390,7 +2390,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n)) > return -EFAULT; > > - KVM_MMU_LOCK(kvm); > + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); > for (offset = log->first_page, i = offset / BITS_PER_LONG, > n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--; > i++, offset += BITS_PER_LONG) { > @@ -2413,7 +2413,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > offset, mask); > } > } > - KVM_MMU_UNLOCK(kvm); > + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); > > if (flush) > kvm_flush_remote_tlbs_memslot(kvm, memslot); > diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h > index ecefc7ec51af..39d8b809c303 100644 > --- a/virt/kvm/kvm_mm.h > +++ b/virt/kvm/kvm_mm.h > @@ -20,6 +20,11 @@ > #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) > #endif /* KVM_HAVE_MMU_RWLOCK */ > > +#ifndef KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT > +#define KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT KVM_MMU_LOCK > +#define KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT KVM_MMU_UNLOCK > +#endif > + > kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, > bool *async, bool write_fault, bool *writable); > >