Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn

James Houghton <jthoughton@xxxxxxxxxx> · Mon, 27 Jan 2025 11:52:28 -0800

On Fri, Jan 10, 2025 at 2:47 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index a24fca3f9e7f..f26d0b60d2dd 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> >  }
> >
> >  /*
> > - * SPTEs must be modified atomically if they are shadow-present, leaf
> > - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> > - * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
> > - * handler, and Accessed and Dirty bits can be set by the CPU.
> > + * SPTEs must be modified atomically if they have bits that can be set outside
> > + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> > + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> > + * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be
> > + * cleared by age_gfn_range().
> >   *
> >   * Note, non-leaf SPTEs do have Accessed bits and those bits are
> >   * technically volatile, but KVM doesn't consume the Accessed bit of
> > @@ -53,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> >  static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
> >  {
> >       return is_shadow_present_pte(old_spte) &&
> > -            is_last_spte(old_spte, level) &&
> > -            spte_has_volatile_bits(old_spte);
> > +            is_last_spte(old_spte, level);
>
> I don't like this change on multiple fronts.  First and foremost, it results in
> spte_has_volatile_bits() being wrong for the TDP MMU.  Second, the same logic
> applies to the shadow MMU; the rmap lock prevents a use-after-free of the page
> that owns the SPTE, but the zapping of the SPTE happens before the writer grabs
> the rmap lock.

Thanks Sean, yes I forgot about the shadow MMU case.

> Lastly, I'm very, very tempted to say we should omit Accessed state from
> spte_has_volatile_bits() and rename it to something like spte_needs_atomic_write().
> KVM x86 no longer flushes TLBs on aging, so we're already committed to incorrectly
> thinking a page is old in rare cases, for the sake of performance.  The odds of
> KVM clobbering the Accessed bit are probably smaller than the odds of missing an
> Accessed update due to a stale TLB entry.
>
> Note, only the shadow_accessed_mask check can be removed.  KVM needs to ensure
> access-tracked SPTEs are zapped properly, and dirty logging can't have false
> negatives.

I've dropped the change to kvm_tdp_mmu_spte_need_atomic_write() and
instead applied this diff.

--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte)
                return true;

        if (spte_ad_enabled(spte)) {
-               if (!(spte & shadow_accessed_mask) ||
-                   (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
+               /*
+                * Do not check the Accessed bit. It can be set (by the CPU)
+                * and cleared (by kvm_tdp_mmu_age_spte()) without holding
+                * the mmu_lock, but when clearing the Accessed bit, we do
+                * not invalidate the TLB, so we can already miss Accessed bit
+                * updates.
+                */
+               if (is_writable_pte(spte) && !(spte & shadow_dirty_mask))
                        return true;
        }

I've also included a new patch to rename spte_has_volatile_bits() to
spte_needs_atomic_write() like you suggested. I merely renamed it in
all locations, including documentation; I haven't reworded the
documentation's use of the word "volatile."

>
> >  }
> >
> >  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4508d868f1cd..f5b4f1060fff 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> >                    ((_only_valid) && (_root)->role.invalid))) {               \
> >               } else
> >
> > +/*
> > + * Iterate over all TDP MMU roots in an RCU read-side critical section.
>
> Heh, that's pretty darn obvious.  It would be far more helpful if the comment
> explained the usage rules, e.g. what is safe (at a high level).

How's this?

+/*
+ * Iterate over all TDP MMU roots in an RCU read-side critical section.
+ * It is safe to iterate over the SPTEs under the root, but their values will
+ * be unstable, so all writes must be atomic. As this routine is meant to be
+ * used without holding the mmu_lock at all, any bits that are flipped must
+ * be reflected in kvm_tdp_mmu_spte_need_atomic_write().
+ */

> > + */
> > +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id)                 \
> > +     list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)         \
> > +             if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||     \
> > +                 (_root)->role.invalid) {                                    \
> > +             } else
> > +
> >  #define for_each_tdp_mmu_root(_kvm, _root, _as_id)                   \
> >       __for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
> >
> > @@ -1168,16 +1177,16 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
> >       u64 new_spte;
> >
> >       if (spte_ad_enabled(iter->old_spte)) {
> > -             iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> > -                                                      iter->old_spte,
> > -                                                      shadow_accessed_mask,
> > -                                                      iter->level);
> > +             iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> > +                                             shadow_accessed_mask);
>
> Align, and let this poke past 80:
>
>                 iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
>                                                                 shadow_accessed_mask);

Done.

> >               new_spte = iter->old_spte & ~shadow_accessed_mask;
> >       } else {
> >               new_spte = mark_spte_for_access_track(iter->old_spte);
> > -             iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
> > -                                                     iter->old_spte, new_spte,
> > -                                                     iter->level);
> > +             /*
> > +              * It is safe for the following cmpxchg to fail. Leave the
> > +              * Accessed bit set, as the spte is most likely young anyway.
> > +              */
> > +             (void)__tdp_mmu_set_spte_atomic(iter, new_spte);
>
> Just a reminder that this needs to be:
>
>                 if (__tdp_mmu_set_spte_atomic(iter, new_spte))
>                         return;
>

Already applied, thanks!


> >       }
> >
> >       trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
> > --
> > 2.47.0.199.ga7371fff76-goog
> >