Re: [PATCH v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 15 Apr 2024 13:00:24 -0700

On Mon, Apr 15, 2024, David Matlack wrote:
>  * I suspect the original issue my patch is trying to fix is actually specific
>    to the way the TDP MMU does eager page splitting and a more targeted fix is
>    warranted.
> 
> ---
> 
> To evaluate my patch I tested on x86 with different mmu_lock configurations
> to simulate other architectures.
> 
>  Config 1: tdp_mmu=Y fast_page_fault_read_lock=N eager_page_split=Y
>  Config 2: tdp_mmu=Y fast_page_fault_read_lock=Y eager_page_split=Y
>  Config 3: tdp_mmu=N fast_page_fault_read_lock=N eager_page_split=N
> 
> Note: "fast_page_fault_read_lock" is a non-upstream parameter I added to
> add a read_lock/unlock() in fast_page_fault().
> 
> Config 1 is vanilla KVM/x86. Config 2 emulates KVM/arm64. Config 3 emulates
> LoongArch if LoongArch added support for lockless write-protection fault
> handling.
> 
> The test I ran was a Live Migration of a 16VCPU 64GB VM running an aggressive
> write-heavy workload. To compare runs I evaluated 3 metrics:
> 
>  * Duration of pre-copy.
>  * Amount of dirty memory going into post-copy.
>  * Total CPU usage of CLEAR_DIRTY_LOG.
> 
> The following table shows how each metric changed after adding my patch to drop
> mmu_lock during CLEAR_DIRTY_LOG.
> 
>           | Precopy Duration | Post-Copy Dirty | CLEAR_DIRTY_LOG CPU
>  ---------|------------------|-----------------|---------------------
>  Config 1 | -1%              | -1%             | +6%
>  Config 2 | -1%              | +1%             | +123%
>  Config 3 | +32%             | +158%           | +5200%
> 
> Config 2 and 3 both show regressions, with Config 3 severely regressed in all 3
> dimensions.

...

> If this is all true, then a better / more targeted fix for this issue would be
> to drop mmu_lock in the TDP MMU eager page splitting path. For example, we
> could limit the "allocate under lock" behavior to only when the read-lock is
> held, e.g.
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7dfdc49a6ade..ea34f8232d97 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>          * If this allocation fails we drop the lock and retry with reclaim
>          * allowed.
>          */
> -       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
> -       if (sp)
> -               return sp;
> +       if (shared) {
> +               sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
> +               if (sp)
> +                       return sp;
> +       }
> 
>         rcu_read_unlock();
> 
> I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to
> allocate page tables. So I suspect no fix is needed there and this is, in fact,
> purely and x86-specific issue.

Hmm, it'd be nice if we didn't have to rely on random arch flows to coincidentally
do the optimal thing for eager page splitting.  Not sure how best to document the
"best known practice" though.

As for the TDP MMU code, unless the cost of dropping and reacquiring mmu_lock for
read is measurable, I would prefer to unconditionally drop mmu_lock, and delete
the GFP_NOWAIT allocation.  There can be lock contention when mmu_lock is held
for read, it's just less common.

On a related topic, I think we should take a hard look at the rwlock_needbreak()
usage in tdp_mmu_iter_cond_resched().  Because dropping when allocating is really
just speculatively dropping mmu_lock because it _might_ be contended, but doing
so at a batch size that provides a good balance between doing enough work under
mmu_lock and providing low latency for vCPUs.  I.e. in theory, we should be able
to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere
near smart enough and it's currently only for preemptible kernels (or at least,
it's supposed to be only for preemptible kernels).

Simply yielding on contention is not at all optimal, as evidenced by the whole
dynamic preemption debacle[1][2].  The immediate issue was "fixed" by having vCPUs
avoid taking mmu_lock, but KVM really shouldn't get into a situation where KVM is
pathologically dropping mmu_lock to the point where a non-vCPU action grinds to
a halt.

The contention logic fails to take into account many things:

 (1) Is the other task higher priority?

 (2) Is the other task a vCPU, or something else?

 (3) Will yielding actually allow the other task to make forward progress?

 (4) What is the cost of dropping mmu_lock, e.g. is a remote TLB flush needed?

 (5) What is the expected duration this task is expected to hold mmu_lock?

 (6) Has this task made enough progress for yielding to be a decent choice?

and probably many more than that.

As we discussed off-list, I think there are two viable approaches:

 (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock
     contention logic, and improve the logic (especially the forward progress
     guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance
     in all cases.

 (b) We completely remove the rwlock_needbreak() checks from
     tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping
     mmu_lock in flows where doing so provides the best overall balance, e.g. as
     in the eager page split case.

I don't have a strong preference between (a) and (b), though I think I'd lean
towards (b), because it's simpler.  My guess is that we can achieve similar
performance results with both.  E.g. odds are decent that the "best" batch size
(see #6) is large enough that the cost of dropping and reacquiring mmu_lock is
in the noise when it's not contented.

The main argument I see for (b) is that it's simpler, as only code that actually
has a justified need to drop mmu_lock does so.  The advantage I see with (a) is
that it would provide structure and documentation for choosing when to drop
mmu_lock (or not).

E.g. dropping in the eager page split path makes sense because KVM does so at a
large batch size, odds are good that the contending task is a vCPU, there's no
TLB flush required, the total hold time of mmu_lock is high, and we know that
dropping mmu_lock will allow vCPUs to make forward progress.  (a) would do a much
better job of capturing all that in code, albeit with quite a bit more complexity.

Regardless of which option we choose, I think we should drop the preemptible kernel
dependency from the lock contention logic in tdp_mmu_iter_cond_resched(), i.e.
manually check if mmu_lock is contented instead of bouncing through rwlock_needbreak().

The current approach essentially means that there's zero testing of the performance
of the yield-on-contention logic.  E.g. the complaints about the TDP MMU yielding
too aggressively only popped up when commit c597bfddc9e9 ("sched: Provide Kconfig
support for default dynamic preempt mode") unintentionally enabled
rwlock_needbreak() by default.

That's definitely easier said then done though, as I suspect that if we switched
to rwlock_is_contended(), i.e. dropped the preemptible requirement, without also
enhancing tdp_mmu_iter_cond_resched() to make it smarter as above, we'd see a lot
of performance regressions.

[1] https://lore.kernel.org/all/20240312193911.1796717-3-seanjc@xxxxxxxxxx
[2] https://lore.kernel.org/all/20240222012640.2820927-1-seanjc@xxxxxxxxxx