Re: [PATCH v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG

David Matlack <dmatlack@xxxxxxxxxx> · Fri, 12 Apr 2024 09:12:31 -0700

On Sat, Apr 6, 2024 at 7:26 PM maobibo <maobibo@xxxxxxxxxxx> wrote:
>
>
>
> On 2024/4/5 上午1:10, Sean Christopherson wrote:
> > On Thu, Apr 04, 2024, David Matlack wrote:
> >> On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@xxxxxxxxxxx> wrote:
> >>>> This change eliminates dips in guest performance during live migration
> >>>> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
> >>>> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
> >>> Frequently drop/reacquire mmu_lock will cause userspace migration
> >>> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed
> >>> maybe become slower.
> >>
> >> In practice we have not found this to be the case. With this patch
> >> applied we see a significant improvement in guest workload throughput
> >> while userspace is issuing CLEAR ioctls without any change to the
> >> overall migration duration.
> >
> > ...
> >
> >> In the case of this patch, there doesn't seem to be a trade-off. We
> >> see an improvement to vCPU performance without any regression in
> >> migration duration or other metrics.
> >
> > For x86.  We need to keep in mind that not all architectures have x86's optimization
> > around dirty logging faults, or around faults in general. E.g. LoongArch's (which
> > I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock.
> > And if the fault can't be handled in the fast path, KVM will actually acquire
> > mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after
> > the mmu_seq and fault-in pfn stuff).
> yes, there is no lock in function fast_page_fault on x86, however mmu
> spinlock is held on fast path on LoongArch. I do not notice this,
> originally I think that there is read_lock on x86 fast path for pte
> modification, write lock about page table modification.

Most[*] vCPU faults on KVM/x86 are handled as follows:

- vCPU write-protection and access-tracking faults are handled
locklessly (fast_page_fault()).
- All other vCPU faults are handled under read_lock(&kvm->mmu_lock).

[*] The locking is different when nested virtualization is in use, TDP
(i.e. stage-2 hardware) is disabled, and/or kvm.tdp_mmu=N.

> >
> > So for x86, I think we can comfortably state that this change is a net positive
> > for all scenarios.  But for other architectures, that might not be the case.
> > I'm not saying this isn't a good change for other architectures, just that we
> > don't have relevant data to really know for sure.
>  From the code there is contention between migration assistant thread
> and vcpu thread in slow path where huge page need be split if memslot is
> dirty log enabled, however there is no contention between migration
> assistant thread and vcpu fault fast path. If it is right, negative
> effect is small on x86.

Right there is no contention between CLEAR_DIRTY_LOG and vCPU
write-protection faults on KVM/x86. KVM/arm64 does write-protection
faults under the read-lock.

KVM/x86 and KVM/arm64 also both have eager page splitting, where the
huge pages are split during CLEAR_DIRTY_LOG, so there are also no vCPU
faults to split huge pages.

>
> >
> > Absent performance data for other architectures, which is likely going to be
> > difficult/slow to get, it might make sense to have this be opt-in to start.  We
> > could even do it with minimal #ifdeffery, e.g. something like the below would allow
> > x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked()
> > (I assume we want to give kvm_get_dirty_log_protect() similar treatment?).
> >
> > I don't love the idea of adding more arch specific MMU behavior (going the wrong
> > direction), but it doesn't seem like an unreasonable approach in this case.
> No special modification for this, it is a little strange. LoongArch page
> fault fast path can improve later.

Sorry, I don't follow exactly what you mean here. Are you saying
Sean's patch is not required for LoongArch and, instead, LoongArch
should/will avoid acquiring the mmu_lock when handling
write-protection faults?