Re: [PATCH v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 3 Apr 2024 10:20:02 -0700

On Wed, Apr 03, 2024, maobibo wrote:
> 
> On 2024/4/3 上午5:36, David Matlack wrote:
> > Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid
> > blocking other threads (e.g. vCPUs taking page faults) for too long.
> > 
> > Specifically, change kvm_clear_dirty_log_protect() to acquire/release
> > mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(),
> > rather than around the entire for loop. This ensures that KVM will only
> > hold mmu_lock for the time it takes the architecture-specific code to
> > process up to 64 pages, rather than holding mmu_lock for log->num_pages,
> > which is controllable by userspace. This also avoids holding mmu_lock
> > when processing parts of the dirty_bitmap that are zero (i.e. when there
> > is nothing to clear).
> > 
> > Moving the acquire/release points for mmu_lock should be safe since
> > dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap
> > is already accessed with atomic_long_fetch_andnot(). And at least on x86
> > holding mmu_lock doesn't even serialize access to the memslot dirty
> > bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding
> > mmu_lock.
> > 
> > This change eliminates dips in guest performance during live migration
> > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
> > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
> Frequently drop/reacquire mmu_lock will cause userspace migration process
> issuing CLEAR ioctls to contend with 160 vCPU, migration speed maybe become
> slower.

Only if vCPUs actually acquire mmu_lock.  E.g. on x86, KVM fixes/handles faults
due to write-protection for dirty logging without acquiring mmu_lock.  So for x86,
taking faults that need to acquire mmu_lock while dirty logging should be fairly
uncommon (and if vCPUs are taking lots of faults, performance is likely going to
be bad no matter what).

> In theory priority of userspace migration thread should be higher than vcpu
> thread.

That's very debatable.  And it's not an apples-to-apples comparison, because
CLEAR_DIRTY_LOG can hold mmu_lock for a very long time, probably orders of
magnitude longer than a vCPU will hold mmu_lock when handling a page fault.

And on x86 and ARM, page faults can be resolved while hold mmu_lock for read.  As
a result, holding mmu_lock in CLEAR_DIRTY_LOG (for write) is effectively more
costly than holding it in vCPUs.

> Drop and reacquire mmu_lock with 64-pages may be a little too smaller,
> in generic it is one huge page size. However it should be decided by
> framework maintainer:)

We could tweak the batching, but my very strong preference would be to do that
only as a last resort, i.e. if and only if some magic batch number provides waaay
better performance in all scenarios.

Maintaining code with arbitrary magic numbers is a pain, e.g. KVM x86's MMU has
arbitrary batching in kvm_zap_obsolete_pages(), and the justification for the
number is extremely handwavy (paraphasing the changelog):

      Zap at least 10 shadow pages before releasing mmu_lock to reduce the
      overhead associated with re-acquiring the lock.

      Note: "10" is an arbitrary number, speculated to be high enough so
      that a vCPU isn't stuck zapping obsolete pages for an extended period,
      but small enough so that other vCPUs aren't starved waiting for
      mmu_lock.

I.e. we're stuck with someone's best guess from years ago without any real idea
if 10 is a good number, let alone optimal.

Obviously that doesn't mean 64 pages is optimal, but it's at least not arbitrary,
it's just an artifact of how KVM processes the bitmap.

To be clear, I'm not opposed to per-arch behavior, nor do I think x86 should
dictate how all other architectures should behave.  But I would strongly prefer
to avoid per-arch behavior unless it's actually necessary (doubly so for batching).

In other words, if we do need per-arch behavior, e.g. because aggressivly dropping
mmu_lock causes performance issues on other architectures that need to take mmu_lock
for write to handle faults, I would prefer to have the arch knob control whether
the lock+unlock is outside versus inside the loop, not control an arbitrary batch
size.