Re: [PATCH v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG

Sean Christopherson <seanjc@xxxxxxxxxx> · Thu, 18 Apr 2024 12:39:40 -0700

On Thu, Apr 18, 2024, David Matlack wrote:
> On Mon, Apr 15, 2024 at 1:00 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > On a related topic, I think we should take a hard look at the rwlock_needbreak()
> > usage in tdp_mmu_iter_cond_resched().  Because dropping when allocating is really
> > just speculatively dropping mmu_lock because it _might_ be contended, but doing
> > so at a batch size that provides a good balance between doing enough work under
> > mmu_lock and providing low latency for vCPUs.  I.e. in theory, we should be able
> > to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere
> > near smart enough and it's currently only for preemptible kernels (or at least,
> > it's supposed to be only for preemptible kernels).
> 
> Dropping the lock for allocating is also to drop GFP_NOWAIT, i.e. to
> allow direct reclaim and other blocking operations. This is valuable
> for "cry-for-help" type migrations where the host is under intense
> memory pressure. I'd rather do the reclaim on the eager page splitting
> thread than a vCPU.

...

> >  (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock
> >      contention logic, and improve the logic (especially the forward progress
> >      guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance
> >      in all cases.
> 
> The only way I can think of to universally measure forward progress
> would be by wall time. Again that becomes more possible with the
> mmu_lock timing stats. But we'll have to hand-pick some thresholds and
> that feels wrong...

Yeah, hard pass on anything based on arbitrary time thresholds.  And I think we
should have the mmu_lock stats be buried behind a Kconfig, i.e. we shouldn't use
them in KVM to guide behavior in any way.

> >  (b) We completely remove the rwlock_needbreak() checks from
> >      tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping
> >      mmu_lock in flows where doing so provides the best overall balance, e.g. as
> >      in the eager page split case.
> >
> > I don't have a strong preference between (a) and (b), though I think I'd lean
> > towards (b), because it's simpler.  My guess is that we can achieve similar
> > performance results with both.  E.g. odds are decent that the "best" batch size
> > (see #6) is large enough that the cost of dropping and reacquiring mmu_lock is
> > in the noise when it's not contented.
> >
> > The main argument I see for (b) is that it's simpler, as only code that actually
> > has a justified need to drop mmu_lock does so.  The advantage I see with (a) is
> > that it would provide structure and documentation for choosing when to drop
> > mmu_lock (or not).
> 
> I need to think it through more but I'm leaning toward (b) and use the
> mmu_lock stats to flag potential flows that are holding the lock too
> long. With (b) we can make each flow incrementally better and don't
> have to pick any magic numbers.

I'm fully in the (b) boat given the GFP_NOWAIT => direct reclaim piece of the
puzzle.

  1. TDP MMU now frees and allocates roots with mmu_lock held for read
  2. The "zap everything" MTRR trainwreck likely on its way out the door[1]
  3. The change_pte() mmu_notifier is gone[2]
  4. We're working on aging GFNs with mmu_lock held for read, or not at all[3]

As a result, the the only paths that take mmu_lock for write are "zap all",
mmu_notifier invalidations events, and APICv inhibit toggling, which does
kvm_zap_gfn_range() on a single page, i.e. won't ever need to yield.

The "slow" zap all is mutually exclusive with anything interesting because it
only runs when the process is exiting.

The "fast" zap all should never yield when it holds mmu_lock for write, because
the part that runs with mmu_lock held for write is <drum roll> fast.  Maaaybe
the slow part that runs with mmu_lock held for read could drop mmu_lock on
contention.

That just leaves mmu_notifier invalidations, and the mess with dynamic preemption
that resulted in KVM bouncing mmu_lock between invalidations and vCPUs is pretty
strong evidence that yielding on mmu_lock contention when zapping SPTEs is a
terrible idea.

And yielding from most flows that hold mmu_lock for read is also a net negative,
as (a) *all* readers need to go aways, (b) the majority of such flows are
short-lived and operate in vCPU context, and (c) the remaining flows that take
mmu_lock are either slow paths (zap all, mmu_notifier invalidations), or rare
(APICv toggling).

In short, I highly doubt we'll ever have more than a few flows where yielding
because mmu_lock is contended is actually desirable.  As a result, odds are good
that the reasons for dropping mmu_lock in the middle of a sequence are going to
be unique each and every time.

E.g. eager page splitting wants to drop it in order to do direct reclaim, and
because it can be super long running, *and* doesn't need to flush TLBs when yielding.

kvm_tdp_mmu_zap_invalidated_roots() is the only flow I can think of where dropping
mmu_lock on contention might make sense, e.g. to allow an mmu_notifier invalidation
to go through.  Again, that's a very long-running flow that doesn't need to flush
TLBs, is (hopefully) not being done from vCPU context, and could put KVM into a
scenario where it blocks an mmu_notifiers, which in turn blocks vCPUs that are
trying to rebuild SPTEs.

[1] https://lore.kernel.org/all/20240309010929.1403984-1-seanjc@xxxxxxxxxx
[2] https://lore.kernel.org/all/20240405115815.3226315-1-pbonzini@xxxxxxxxxx
[3] https://lore.kernel.org/all/20240401232946.1837665-1-jthoughton@xxxxxxxxxx