Re: [PATCH v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-04-12 09:14 AM, David Matlack wrote:
> On Thu, Apr 4, 2024 at 11:17 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Thu, Apr 04, 2024, David Matlack wrote:
> > > > I don't love the idea of adding more arch specific MMU behavior (going the wrong
> > > > direction), but it doesn't seem like an unreasonable approach in this case.
> > >
> > > I wonder if this is being overly cautious.
> >
> > Probably.  "Lazy" is another word for it ;-)
> >
> > > I would expect only more benefit on architectures that more aggressively take
> > > the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU
> > > threads, the more this patch will help reduce vCPU starvation during
> > > CLEAR_DIRTY_LOG.
> > >
> > > Hm, perhaps testing with ept=N (which will use the write-lock for even
> > > dirty logging faults) would be a way to increase confidence in the
> > > effect on other architectures?
> >
> > Turning off the TDP MMU would be more representative, just manually disable the
> > fast-path, e.g.
> 
> Good idea. I'm actually throwing in some writable module parameters
> too to make it easy to toggle between configurations.
> 
> I'll report back when I have some data.

tl;dr

 * My patch likely _will_ regress migration performance on other architectures.
   Thank you Bibo and Sean for keeping me honest here.

 * I suspect the original issue my patch is trying to fix is actually specific
   to the way the TDP MMU does eager page splitting and a more targeted fix is
   warranted.

---

To evaluate my patch I tested on x86 with different mmu_lock configurations
to simulate other architectures.

 Config 1: tdp_mmu=Y fast_page_fault_read_lock=N eager_page_split=Y
 Config 2: tdp_mmu=Y fast_page_fault_read_lock=Y eager_page_split=Y
 Config 3: tdp_mmu=N fast_page_fault_read_lock=N eager_page_split=N

Note: "fast_page_fault_read_lock" is a non-upstream parameter I added to
add a read_lock/unlock() in fast_page_fault().

Config 1 is vanilla KVM/x86. Config 2 emulates KVM/arm64. Config 3 emulates
LoongArch if LoongArch added support for lockless write-protection fault
handling.

The test I ran was a Live Migration of a 16VCPU 64GB VM running an aggressive
write-heavy workload. To compare runs I evaluated 3 metrics:

 * Duration of pre-copy.
 * Amount of dirty memory going into post-copy.
 * Total CPU usage of CLEAR_DIRTY_LOG.

The following table shows how each metric changed after adding my patch to drop
mmu_lock during CLEAR_DIRTY_LOG.

          | Precopy Duration | Post-Copy Dirty | CLEAR_DIRTY_LOG CPU
 ---------|------------------|-----------------|---------------------
 Config 1 | -1%              | -1%             | +6%
 Config 2 | -1%              | +1%             | +123%
 Config 3 | +32%             | +158%           | +5200%

Config 2 and 3 both show regressions, with Config 3 severely regressed in all 3
dimensions.

Given these regressions, I started rethinking the original issue this patch is
trying to fix.

The dips in guest performance during CLEAR_DIRTY_LOG occurred during the first
pre-copy pass but not during subsequent passes. One thing unique to the first
pass is eager page splitting.

Ah ha, a theory! The TDP MMU allocates shadow pages while holding the mmu_lock
during eager page splitting and only drops the lock if need_resched=True or a
GFP_NOWAIT allocation fails. If neither occurs, CLEAR_DIRTY_LOG could potential
hold mmu_lock in write-mode for a long time.

Second, the host platform where we saw the dips has nx_huge_pages=Y. I suspect
the long CLEAR_DIRTY_LOG calls are blocking vCPUs taking steady-state
faults for NX Huge Pages, causing the dips in performance.

This theory also explains why we (Google) haven't seen similar drops in guest
performance when using !manual-protect, as in that configuration the TDP MMU
does eager page splitting under the read-lock instead of write-lock.

If this is all true, then a better / more targeted fix for this issue would be
to drop mmu_lock in the TDP MMU eager page splitting path. For example, we
could limit the "allocate under lock" behavior to only when the read-lock is
held, e.g.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7dfdc49a6ade..ea34f8232d97 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
         * If this allocation fails we drop the lock and retry with reclaim
         * allowed.
         */
-       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
-       if (sp)
-               return sp;
+       if (shared) {
+               sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
+               if (sp)
+                       return sp;
+       }

        rcu_read_unlock();

I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to
allocate page tables. So I suspect no fix is needed there and this is, in fact,
purely and x86-specific issue.




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux