Re: [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 4 Sep 2024 08:48:39 -0700

On Tue, Sep 03, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > DO NOT MERGE, yet...
> >
> > Cc: James Houghton <jthoughton@xxxxxxxxxx>
> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 59 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 48e8608c2738..9df6b465de06 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -995,13 +995,11 @@ static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> >   * locking is the same, but the caller is disallowed from modifying the rmap,
> >   * and so the unlock flow is a nop if the rmap is/was empty.
> >   */
> > -__maybe_unused
> >  static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
> >  {
> >         return __kvm_rmap_lock(rmap_head);
> >  }
> >
> > -__maybe_unused
> >  static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
> >                                      unsigned long old_val)
> >  {
> > @@ -1743,8 +1741,53 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
> >         __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
> >  }
> >
> > -static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> > -                                  struct kvm_gfn_range *range, bool test_only)
> > +static bool kvm_rmap_age_gfn_range_lockless(struct kvm *kvm,
> > +                                           struct kvm_gfn_range *range,
> > +                                           bool test_only)
> > +{
> > +       struct kvm_rmap_head *rmap_head;
> > +       struct rmap_iterator iter;
> > +       unsigned long rmap_val;
> > +       bool young = false;
> > +       u64 *sptep;
> > +       gfn_t gfn;
> > +       int level;
> > +       u64 spte;
> > +
> > +       for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > +               for (gfn = range->start; gfn < range->end;
> > +                    gfn += KVM_PAGES_PER_HPAGE(level)) {
> > +                       rmap_head = gfn_to_rmap(gfn, level, range->slot);
> > +                       rmap_val = kvm_rmap_lock_readonly(rmap_head);
> > +
> > +                       for_each_rmap_spte_lockless(rmap_head, &iter, sptep, spte) {
> > +                               if (!is_accessed_spte(spte))
> > +                                       continue;
> > +
> > +                               if (test_only) {
> > +                                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> > +                                       return true;
> > +                               }
> > +
> > +                               /*
> > +                                * Marking SPTEs for access tracking outside of
> > +                                * mmu_lock is unsupported.  Report the page as
> > +                                * young, but otherwise leave it as-is.
> 
> Just for my own understanding, what's the main reason why it's unsafe

Note, I specifically said "unsupported", not "unsafe" :-D

> to mark PTEs for access tracking outside the mmu_lock?

It probably can be done safely?  The main issue is that marking the SPTE for
access tracking can also clear the Writable bit, and so we'd need to audit all
the flows that consume is_writable_pte().  Hmm, actually, that's less scary than
it first seems, because thanks to kvm_mmu_notifier_clear_young(), KVM already
clears the Writable bit in AD-disabled SPTEs without a TLB flush.  E.g.
mmu_spte_update() specifically looks at MMU-writable, not the Writable bit, when
deciding if a TLB flush is required.

On a related note, I missed is that KVM would need to leaf SPTEs as volatile at
all times, as your MGLRU series modified kvm_tdp_mmu_spte_need_atomic_write(),
not the common spte_has_volatile_bits().

Actually, on second though, maybe it isn't necessary for the AD-enabled case.
Effectively clobbering the Accessed bit is completely fine, as aging is tolerant
of false negatives and false positives, so long as they aren't excessive.  And
that's doubly true if KVM x86 follows MM and doesn't force a TLB flush[1]

Oooh, and triply true if KVM stops marking the folio accesses when zapping SPTEs[2].

So yeah, after thinking though all of the moving parts, maybe we should commit
to aging AD-disabled SPTEs out of mmu_lock.  AD-disabled leaf SPTEs do end up being
"special", because KVM needs to ensure it doesn't clobber the Writable bit, i.e.
AD-disabled leaf SPTEs need to be treated as volatile at all times.  But in practice,
forcing an atomic update for all AD-disabled leaf SPTEs probably doesn't actually
change much, because in most cases KVM is probably using an atomic access anyways,
e.g. because KVM is clearing the Writable bit and the Writable bit is already volatile.

FWIW, marking the folio dirty if the SPTE was writable, as is done today in
mmu_spte_age(), is sketchy if mmu_lock isn't held, but probably ok since this is
invoked from an mmu_notifier and presumably the caller holds a reference to the
page/folio.  But that's largely a moot point since I want to yank out that code
anyways[3].

[1] https://lore.kernel.org/all/ZsS_OmxwFzrqDcfY@xxxxxxxxxx
[2] https://lore.kernel.org/all/20240726235234.228822-82-seanjc@xxxxxxxxxx
[3] https://lore.kernel.org/all/20240726235234.228822-8-seanjc@xxxxxxxxxx

> > +                               if (spte_ad_enabled(spte))
> > +                                       clear_bit((ffs(shadow_accessed_mask) - 1),
> > +                                                 (unsigned long *)sptep);
> 
> I feel like it'd be kinda nice to de-duplicate this clear_bit() piece
> with the one in kvm_rmap_age_gfn_range().

Ya, definitely no argument against adding a helper.

> > +                               young = true;
> > +                       }
> > +
> > +                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> > +               }
> > +       }
> > +       return young;
> > +}
> > +
> > +static bool __kvm_rmap_age_gfn_range(struct kvm *kvm,
> > +                                    struct kvm_gfn_range *range, bool test_only)
> >  {
> >         struct slot_rmap_walk_iterator iterator;
> >         struct rmap_iterator iter;
> > @@ -1783,6 +1826,18 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> >         return young;
> >  }
> >
> > +
> > +static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> > +                                  struct kvm_gfn_range *range, bool test_only)
> > +{
> > +       /* FIXME: This also needs to be guarded with something like range->fast_only. */
> > +       if (kvm_ad_enabled())
> 
> I expect this to be something like `if (kvm_ad_enabled() ||
> range->fast_only)`. With MGLRU, that means the pages will always be the last
> candidates for eviction, though it is still possible for them to be evicted
> (though I think this would basically never happen). I think this is fine.
> 
> I think the only other possible choice is to record/return 'not young'/false
> instead of 'young'/true if the spte is young but !spte_ad_enabled(). That
> doesn't seem to be obviously better, though we *will* get correct age
> information at eviction time, when !range->fast_only, at which point the page
> will not be evicted, and Accessed will be cleared.

As above, I think the simpler solution overall is to support aging AD-disabled
SPTEs out of mmu_lock.  The sequence of getting to that end state will be more
complex, but most of that complexity is going to happen irrespective of this series.
And it would mean KVM MGLRU support has no chance of landing in 6.12, but again
I think that's the reality either way.