Re: [Patch] KVM: x86/mmu: Make optimized __handle_changed_spte() for clear dirty log

Vipin Sharma <vipinsh@xxxxxxxxxx> · Sat, 28 Jan 2023 09:20:00 -0800

On Wed, Jan 25, 2023 at 5:49 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, Jan 25, 2023, Ben Gardon wrote:
> > On Wed, Jan 25, 2023 at 2:00 PM David Matlack <dmatlack@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Jan 25, 2023 at 01:38:57PM -0800, Vipin Sharma wrote:
> > > > @@ -736,7 +749,12 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > > >
> > > >       old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> > > >
> > > > -     __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> > > > +     if (record_dirty_log)
> > > > +             __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte,
> > > > +                                   level, false);
> > > > +     else
> > > > +             handle_changed_spte_clear_dirty_log(as_id, gfn, old_spte,
> > > > +                                                 new_spte, level);
> > >
> > > I find it very non-intuitive to tie this behavior to !record_dirty_log,
> > > even though it happens to work. It's also risky if any future calls are
> > > added that pass record_dirty_log=false but change other bits in the
> > > SPTE.
> > >
> > > I wonder if we could get the same effect by optimizing
> > > __handle_changed_spte() to check for a cleared D-bit *first* and if
> > > that's the only diff between the old and new SPTE, bail immediately
> > > rather than doing all the other checks.
> >
> > I would also prefer that course. One big lesson I took when building
> > the TDP MMU was the value of having all the changed PTE handling go
> > through one function. That's not always the right choice but the
> > Shadow MMU has a crazy spaghetti code of different functions which
> > handle different parts of responding to a PTE change and it makes the
> > code very hard to read. I agree this path is worth optimizing, but the
> > more we can keep the code together, the better.
>
> Hrm, not sure I agree on that last point.  I find record_dirty_log and record_acc_track
> to be eye sores and terribly confusing (I always forget when they're true/false).
> I think we would end up with cleaner code overall if we special case clearing A/D
> bits (or their write-protected/access-protect counterparts).
>
> handle_changed_spte_dirty_log() takes effect if and only a SPTE becomes newly
> writable, and that should never happen when aging gfns (record_acc_track=%false),
> i.e. neither the "age gfns" nor the "clear dirty" paths need to call
> handle_changed_spte_dirty_log(), which means @record_dirty_log is superfluous.
>
> Similarly, clearing dirty bits should never clear the accessed bit, nor should it
> toggle PRESENT or change the PFN, i.e. neither path needs to call
> handle_changed_spte_acc_track(), which means @record_acc_track is superfluous too.
>
> If we jettison those, then AFAICT the only remaining heuristic is that
> tdp_mmu_set_spte_atomic() doesn't update the dirty bitmaps (the comment about
> that behavior is unhelpful and doesn't explain _why_).  That heuristic is easy to
> handled by looking at @shared.

Looking at the git history and current usage,
tdp_mmu_set_spte_atomic() is used for
1. clearing_dirty_gfn_range() for enabling dirty log in PML
2. wrprot_gfn_range() for enabling dirty logging for wrprot and PML
3. zapping pages
4. splitting hugepages

None of the above will make a page dirty.

>
> Looking through all the other stuff in __handle_changed_spte()...
>
> I'm 100% comfortable skipping these sanity checks:
>
>         if (was_leaf && is_leaf && pfn_changed)
>                 BUG();
>
>         if (is_leaf)
>                 check_spte_writable_invariants(new_spte);
>
>         if (!was_present && !is_present)
>                 WARN_ON(!MMIO && !REMOVED);
>
> And none of these are relevant (again assuming we don't have an egregious bug)
> except for the kvm_set_pfn_dirty() case, which is trivial to handle.
>
>         if (is_leaf != was_leaf)
>                 kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
>
>         if (was_leaf && is_dirty_spte(old_spte) &&
>             (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
>                 kvm_set_pfn_dirty(spte_to_pfn(old_spte));
>
>         /*
>          * Recursively handle child PTs if the change removed a subtree from
>          * the paging structure.  Note the WARN on the PFN changing without the
>          * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
>          * pages are kernel allocations and should never be migrated.
>          */
>         if (was_present && !was_leaf &&
>             (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
>                 handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
>
>
> So unless I'm missing something, I think we can end up with the below.  Compile
> tested only...
>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 92 ++++++++++----------------------------
>  1 file changed, 24 insertions(+), 68 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index bba33aea0fb0..2f78ca43a276 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -505,7 +505,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  }
>
>  /**
> - * __handle_changed_spte - handle bookkeeping associated with an SPTE change
> + * handle_changed_spte - handle bookkeeping associated with an SPTE change
>   * @kvm: kvm instance
>   * @as_id: the address space of the paging structure the SPTE was a part of
>   * @gfn: the base GFN that was mapped by the SPTE
> @@ -519,9 +519,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>   * Handle bookkeeping that might result from the modification of a SPTE.
>   * This function must be called for all TDP SPTE modifications.
>   */
> -static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -                                 u64 old_spte, u64 new_spte, int level,
> -                                 bool shared)
> +static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> +                               u64 old_spte, u64 new_spte, int level,
> +                               bool shared)
>  {
>         bool was_present = is_shadow_present_pte(old_spte);
>         bool is_present = is_shadow_present_pte(new_spte);
> @@ -605,23 +605,18 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>         if (was_present && !was_leaf &&
>             (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
>                 handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> -}
>
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -                               u64 old_spte, u64 new_spte, int level,
> -                               bool shared)
> -{
> -       __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> -                             shared);
>         handle_changed_spte_acc_track(old_spte, new_spte, level);
> -       handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> -                                     new_spte, level);
> +
> +       /* COMMENT GOES HERE. */

Current "shared" callers are not making a page dirty. If a new
"shared" caller makes a page dirty then make sure
handle_changed_spte_dirty_log is called.

How is this?

> +       if (!shared)
> +               handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> +                                             new_spte, level);
>  }
>
>  /*
>   * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
> - * and handle the associated bookkeeping.  Do not mark the page dirty
> - * in KVM's dirty bitmaps.
> + * and handle the associated bookkeeping.
>   *
>   * If setting the SPTE fails because it has changed, iter->old_spte will be
>   * refreshed to the current value of the spte.
> @@ -658,9 +653,8 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
>         if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
>                 return -EBUSY;
>
> -       __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -                             new_spte, iter->level, true);
> -       handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
> +       handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> +                           new_spte, iter->level, true);
>
>         return 0;
>  }
> @@ -705,23 +699,12 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>   * @new_spte:        The new value that will be set for the SPTE
>   * @gfn:             The base GFN that was (or will be) mapped by the SPTE
>   * @level:           The level _containing_ the SPTE (its parent PT's level)
> - * @record_acc_track: Notify the MM subsystem of changes to the accessed state
> - *                   of the page. Should be set unless handling an MMU
> - *                   notifier for access tracking. Leaving record_acc_track
> - *                   unset in that case prevents page accesses from being
> - *                   double counted.
> - * @record_dirty_log: Record the page as dirty in the dirty bitmap if
> - *                   appropriate for the change being made. Should be set
> - *                   unless performing certain dirty logging operations.
> - *                   Leaving record_dirty_log unset in that case prevents page
> - *                   writes from being double counted.
>   *
>   * Returns the old SPTE value, which _may_ be different than @old_spte if the
>   * SPTE had voldatile bits.
>   */
>  static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> -                             u64 old_spte, u64 new_spte, gfn_t gfn, int level,
> -                             bool record_acc_track, bool record_dirty_log)
> +                             u64 old_spte, u64 new_spte, gfn_t gfn, int level)
>  {
>         lockdep_assert_held_write(&kvm->mmu_lock);
>
> @@ -736,46 +719,18 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>
>         old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
>
> -       __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> -
> -       if (record_acc_track)
> -               handle_changed_spte_acc_track(old_spte, new_spte, level);
> -       if (record_dirty_log)
> -               handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> -                                             new_spte, level);
> +       handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
>         return old_spte;
>  }
>
> -static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> -                                    u64 new_spte, bool record_acc_track,
> -                                    bool record_dirty_log)
> +static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> +                                    u64 new_spte)
>  {
>         WARN_ON_ONCE(iter->yielded);
>
>         iter->old_spte = __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep,
>                                             iter->old_spte, new_spte,
> -                                           iter->gfn, iter->level,
> -                                           record_acc_track, record_dirty_log);
> -}
> -
> -static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> -                                   u64 new_spte)
> -{
> -       _tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
> -}
> -
> -static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
> -                                                struct tdp_iter *iter,
> -                                                u64 new_spte)
> -{
> -       _tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
> -}
> -
> -static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
> -                                                struct tdp_iter *iter,
> -                                                u64 new_spte)
> -{
> -       _tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
> +                                           iter->gfn, iter->level);
>  }
>
>  #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
> @@ -925,7 +880,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>                 return false;
>
>         __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> -                          sp->gfn, sp->role.level + 1, true, true);
> +                          sp->gfn, sp->role.level + 1);
>
>         return true;
>  }
> @@ -1289,8 +1244,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
>                 new_spte = mark_spte_for_access_track(new_spte);
>         }
>
> -       tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
> -
> +       kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte, new_spte, iter->level);
>         return true;
>  }
>
> @@ -1326,7 +1280,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
>          * Note, when changing a read-only SPTE, it's not strictly necessary to
>          * zero the SPTE before setting the new PFN, but doing so preserves the
>          * invariant that the PFN of a present * leaf SPTE can never change.
> -        * See __handle_changed_spte().
> +        * See handle_changed_spte().
>          */
>         tdp_mmu_set_spte(kvm, iter, 0);
>
> @@ -1351,7 +1305,7 @@ bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>         /*
>          * No need to handle the remote TLB flush under RCU protection, the
>          * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
> -        * shadow page.  See the WARN on pfn_changed in __handle_changed_spte().
> +        * shadow page.  See the WARN on pfn_changed in handle_changed_spte().
>          */
>         return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
>  }
> @@ -1703,9 +1657,11 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
>                                 new_spte = iter.old_spte & ~shadow_dirty_mask;
>                         else
>                                 continue;
> +
> +                       kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
>                 }
>

Shouldn't we handle spte_ad_need_write_protect(iter.old_spte)
separately and if this function returns true then on clearing
PT_WRITABLE_MASK, kvm_set_pfn_dirty be called?
My understanding is that the spte_ad_need_write_protect() will return
true for nested VM sptes when PML mode is enabled.

> -               tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
> +               kvm_tdp_mmu_write_spte(iter.sptep, iter.old_spte, new_spte, iter.level);
>         }
>
>         rcu_read_unlock();
>
> base-commit: f15a87c006901e02727bf8ac75b0251cdf8e0ecc
> --
>