Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 11 Jun 2024 12:42:35 -0700

On Tue, Jun 11, 2024, James Houghton wrote:
> On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >
> > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@xxxxxxxxxx> wrote:
> > >
> > > This new notifier is for multi-gen LRU specifically
> >
> > Let me call it out before others do: we can't be this self-serving.
> >
> > > as it wants to be
> > > able to get and clear age information from secondary MMUs only if it can
> > > be done "fast".
> > >
> > > By having this notifier specifically created for MGLRU, what "fast"
> > > means comes down to what is "fast" enough to improve MGLRU's ability to
> > > reclaim most of the time.
> > >
> > > Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx>
> >
> > If we'd like this to pass other MM reviewers, especially the MMU
> > notifier maintainers, we'd need to design a generic API that can
> > benefit all the *existing* users: idle page tracking [1], DAMON [2]
> > and MGLRU.
> >
> > Also I personally prefer to extend the existing callbacks by adding
> > new parameters, and on top of that, I'd try to consolidate the
> > existing callbacks -- it'd be less of a hard sell if my changes result
> > in less code, not more.
> >
> > (v2 did all these, btw.)
> 
> I think consolidating the callbacks is cleanest, like you had it in
> v2. I really wasn't sure about this change honestly, but it was my
> attempt to incorporate feedback like this[3] from v4. I'll consolidate
> the callbacks like you had in v2.

James, wait for others to chime in before committing yourself to a course of
action, otherwise you're going to get ping-ponged to hell and back.

> Instead of the bitmap like you had, I imagine we'll have some kind of
> flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR,
> MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does
> that sound ok?

Why do we need a bundle of flags?  If we extend .clear_young() and .test_young()
as Yu suggests, then we only need a single "bool fast_only".

As for adding a fast_only versus dedicated APIs, I don't have a strong preference.
Extending will require a small amount of additional churn, e.g. to pass in false,
but that doesn't seem problematic on its own.  On the plus side, there would be
less copy+paste in include/linux/mmu_notifier.h (though that could be solved with
macros :-) ).

E.g. 

--

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 7b77ad6cf833..07872ae00fa6 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 
 int __mmu_notifier_clear_young(struct mm_struct *mm,
                               unsigned long start,
-                              unsigned long end)
+                              unsigned long end,
+                              bool fast_only)
 {
        struct mmu_notifier *subscription;
        int young = 0, id;
@@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
        hlist_for_each_entry_rcu(subscription,
                                 &mm->notifier_subscriptions->list, hlist,
                                 srcu_read_lock_held(&srcu)) {
-               if (subscription->ops->clear_young)
-                       young |= subscription->ops->clear_young(subscription,
-                                                               mm, start, end);
+               if (!subscription->ops->clear_young ||
+                   fast_only && !subscription->ops->has_fast_aging)
+                       continue;
+
+               young |= subscription->ops->clear_young(subscription,
+                                                       mm, start, end);
        }
        srcu_read_unlock(&srcu, id);
 
@@ -403,7 +407,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
 }
 
 int __mmu_notifier_test_young(struct mm_struct *mm,
-                             unsigned long address)
+                             unsigned long address,
+                             bool fast_only)
 {
        struct mmu_notifier *subscription;
        int young = 0, id;
@@ -412,12 +417,15 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
        hlist_for_each_entry_rcu(subscription,
                                 &mm->notifier_subscriptions->list, hlist,
                                 srcu_read_lock_held(&srcu)) {
-               if (subscription->ops->test_young) {
-                       young = subscription->ops->test_young(subscription, mm,
-                                                             address);
-                       if (young)
-                               break;
-               }
+               if (!subscription->ops->test_young)
+                       continue;
+
+               if (fast_only && !subscription->ops->has_fast_aging)
+                       continue;
+
+               young = subscription->ops->test_young(subscription, mm, address);
+               if (young)
+                       break;
        }
        srcu_read_unlock(&srcu, id);
-- 

It might also require multiplexing the return value to differentiate between
"young" and "failed".  Ugh, but the code already does that, just in a bespoke way.

Double ugh.  Peeking ahead at the "failure" code, NAK to adding
kvm_arch_young_notifier_likely_fast for all the same reasons I objected to
kvm_arch_has_test_clear_young() in v1.  Please stop trying to do anything like
that, I will NAK each every attempt to have core mm/ code call directly into KVM.

Anyways, back to this code, before we spin another version, we need to agree on
exactly what behavior we want out of secondary MMUs.  Because to me, the behavior
proposed in this version doesn't make any sense.

Signalling failure because KVM _might_ have relevant aging information in SPTEs
that require taking kvm->mmu_lock is a terrible tradeoff.  And for the test_young
case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP MMU, then
KVM should return "young", not "failed".

If KVM is using the TDP MMU, i.e. has_fast_aging=true, then there will be rmaps
if and only if L1 ran a nested VM at some point.  But as proposed, KVM doesn't
actually check if there are any shadow TDP entries to process.  That could be
fixed by looking at kvm->arch.indirect_shadow_pages, but even then it's not clear
that bailing if kvm->arch.indirect_shadow_pages > 0 makes sense.

E.g. if L1 happens to be running an L2, but <10% of the VM's memory is exposed to
L2, then "failure" is pretty much guaranteed to a false positive.  And even for
the pages that are exposed to L2, "failure" will occur if and only if the pages
are being accessed _only_ by L2.

There most definitely are use cases where the majority of a VM's memory is accessed
only by L2.  But if those use cases are performing poorly under MGLRU, then IMO
we should figure out a way to enhance KVM to do a fast harvest of nested TDP
Accessed information, not make MGRLU+KVM suck for a VMs that run nested VMs.

Oh, and calling into mmu_notifiers to do the "slow" version if the fast version
fails is suboptimal.

So rather than failing the fast aging, I think what we want is to know if an
mmu_notifier found a young SPTE during a fast lookup.  E.g. something like this
in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps()
is an optional optimization to avoid taking mmu_lock for write in paths where a
(very rare) false negative is acceptable.

  static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
  {
	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
  }

  static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
			 bool fast_only)
  {
	int young = 0;

	if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
		write_lock(&kvm->mmu_lock);
		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
		write_unlock(&kvm->mmu_lock);
	}

	if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range))
		young = 1 | MMU_NOTIFY_WAS_FAST;

	return (int)young;
  }

and then in lru_gen_look_around():

	if (spin_is_contended(pvmw->ptl))
		return false;

	/* exclude special VMAs containing anon pages from COW */
	if (vma->vm_flags & VM_SPECIAL)
		return false;

	young = ptep_clear_young_notify(vma, addr, pte);
	if (!young)
		return false;

	if (!(young & MMU_NOTIFY_WAS_FAST))
		return true;

	young = 1;

with the lookaround done using ptep_clear_young_notify_fast().

The MMU_NOTIFY_WAS_FAST flag is gross, but AFAICT it would Just Work without
needing to update all users of ptep_clear_young_notify() and friends.