Re: [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Aug 05, 2024, David Matlack wrote:
> > Recheck the iter.old_spte still points to a page table when recovering
> > huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> > another CPU in between stepping down and back up.
> >
> > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> > recovery worker, or vCPUs taking faults on the huge page region).
> >
> > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> > thus can see a different value, which is not immediately obvious when
> > reading the code.
> >
> > Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 07d5363c9db7..bdc7fd476721 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
> >               while (max_mapping_level > iter.level)
> >                       tdp_iter_step_up(&iter);
> >
> > +             /*
> > +              * Re-check that iter.old_spte still points to a page table.
> > +              * Since mmu_lock is held for read and tdp_iter_step_up()
> > +              * re-reads iter.sptep, it's possible the SPTE was zapped or
> > +              * recovered by another CPU in between stepping down and
> > +              * stepping back up.
> > +              */
> > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > +                 is_last_spte(iter.old_spte, iter.level))
> > +                     continue;
>
> This is the part of the step-up logic that I do not like.  Even this check doesn't
> guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
> used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
> could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
> only because of several very subtle mechanisms.

I'm not sure why that matters. The only thing that matters is that the
GFN->PFN and permissions cannot change, and that is guaranteed by
holding mmu_lock for read.

At the end of the day, we never actually care about the value of the
SPTE we are replacing. We only care that it's a non-leaf SPTE.

>
> kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
> in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
> requires knowing/remembering that KVM disallows huge pages when a gfn is write-
> tracked, and relies on that never changing (which is a fairly safe bet, but the
> behavior isn't fully set in stone).
> not set.
>
> And the presence of a shadow-present leaf SPTE ensures there are no in-flight
> mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
> return until all relevant leaf SPTEs have been zapped.

As you point out in the next paragraph there could be an inflight
invalidate that yielded, no?

>
> And even more subtly, recover_huge_pages_range() can install a huge SPTE while
> tdp_mmu_zap_leafs() is running, e.g. if tdp_mmu_zap_leafs() is processing 4KiB
> SPTEs because the greater 2MiB page is being unmapped by the primary MMU, and
> tdp_mmu_zap_leafs() yields.   That's again safe only because upon regaining
> control, tdp_mmu_zap_leafs() will restart at the root and thus observe and zap
> the new huge SPTE.

I agree it's subtle, but only in the sense that the TDP MMU is subtle.
Restarting at the root after dropping mmu_lock is a fundamental
concept in the TDP MMU.

>
> So while I'm pretty sure this logic is functionally ok, its safety is extremely
> dependent on a number of behaviors in KVM.
>
> That said, I can't tell which option I dislike less.  E.g. we could do something
> like this, where kvm_mmu_name_tbd() grabs the pfn+writable information from the
> primary MMU's PTE/PMD/PUD.  Ideally, KVM would use GUP, but then KVM wouldn't
> be able to create huge SPTEs for non-GUP-able memory, e.g. PFNMAP'd memory.
>
> I don't love this either, primarily because not using GUP makes this yet another
> custom flow

Yeah. I don't like having the huge page recovery path needing its own
special way to construct SPTEs from scratch. e.g. I could see this
approach becoming a problem if KVM gains support for R/W/X GFN
attributes.

> i.e. isn't any less tricky than reusing a child SPTE.  It does have
> the advantage of not having to find a shadow-present child though, i.e. is likely
> the most performant option.  I agree that that likely doesn't matter in practice,
> especially since the raw latency of disabling dirty logging isn't terribly
> important.
>
>         rcu_read_lock();
>
>         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
> retry:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>                         continue;
>
>                 if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
>                     !is_shadow_present_pte(iter.old_spte))
>                         continue;
>
>                 /*
>                  * TODO: this should skip to the end of the parent, because if
>                  * the first SPTE can't be made huger, then no SPTEs at this
>                  * level can be huger.
>                  */
>                 if (is_last_spte(iter.old_spte, iter.level))
>                         continue;
>
>                 /*
>                  * If iter.gfn resides outside of the slot, i.e. the page for
>                  * the current level overlaps but is not contained by the slot,
>                  * then the SPTE can't be made huge.  More importantly, trying
>                  * to query that info from slot->arch.lpage_info will cause an
>                  * out-of-bounds access.
>                  */
>                 if (iter.gfn < start || iter.gfn >= end)
>                         continue;
>
>                 if (kvm_mmu_name_tbd(kvm, slot, iter.gfn, &pfn, &writable,
>                                      &max_mapping_level))
>                         continue;
>
>                 /*
>                  * If the range is being invalidated, do not install a SPTE as
>                  * KVM may have already zapped this specific gfn, e.g. if KVM's
>                  * unmapping has run to completion, but the primary MMU hasn't
>                  * zapped its PTEs.  There is no need to check for *past*
>                  * invalidations, because all information is gathered while
>                  * holding mmu_lock, i.e. it can't have become stale due to a
>                  * entire mmu_notifier invalidation sequence completing.
>                  */
>                 if (mmu_invalidate_retry_gfn(kvm, kvm->mmu_invalidate_seq, iter.gfn))
>                         continue;
>
>                 /*
>                  * KVM disallows huge pages for write-protected gfns, it should
>                  * impossible for make_spte() to encounter such a gfn since
>                  * write-protecting a gfn requires holding mmu_lock for write.
>                  */
>                 flush |= __tdp_mmu_map_gfn(...);
>                 WARN_ON_ONCE(r == RET_PF_EMULATE);
>         }
>
>         rcu_read_unlock();
>
> Assuming you don't like the above idea (I'm not sure I do), what if instead of
> doing the step-up, KVM starts a second iteration using the shadow page it wants
> to replace as the root of the walk?
>
> This has the same subtle dependencies on kvm_mmu_max_mapping_level() and the
> ordering with respect to an mmu_notifier invalidation, but it at least avoids
> having to reason about the correctness of re-reading a SPTE and modifying the
> iteration level within the body of an interation loop.
>
> It should also yield smaller diffs overall, e.g. no revert, no separate commit
> to recheck the SPTE, etc.  And I believe it's more performant that the step-up
> approach when there are SPTE that _can't_ be huge, as KVM won't traverse down
> into the leafs in that case.

This approach looks good to me. I'll try it out and see if I run into
any issues.

>
> An alternative to the tdp_mmu_iter_needs_reched() logic would be to pass in
> &flush, but I think that ends up being more confusing and harder to follow.

Yeah I think that would be more complicated. If we drop mmu_lock then
we need to re-check kvm_mmu_max_mapping_level() and restart at the
root.

>
> static int tdp_mmu_make_huge_spte(struct kvm *kvm, struct tdp_iter *parent,
>                                   u64 *huge_spte)
> {
>         struct kvm_mmu_page *root = sptep_to_sp(parent->sptep);
>         gfn_t start = parent->gfn;
>         gfn_t end = start + ???; /* use parent's level */
>         struct tdp_iter iter;
>
>         tdp_root_for_each_leaf_pte(iter, root, start, end)      {
>                 /*
>                  * Use the parent iterator when checking for forward progress,
>                  * so that KVM doesn't get stuck due to always yielding while
>                  * walking child SPTEs.
>                  */
>                 if (tdp_mmu_iter_needs_reched(kvm, parent))
>                         return -EAGAIN;
>
>                 *huge_spte = make_huge_spte(kvm, iter.old_spte);
>                 return 0;
>         }
>
>         return -ENOENT;
> }
>
> static void recover_huge_pages_range(struct kvm *kvm,
>                                      struct kvm_mmu_page *root,
>                                      const struct kvm_memory_slot *slot)
> {
>         gfn_t start = slot->base_gfn;
>         gfn_t end = start + slot->npages;
>         struct tdp_iter iter;
>         int max_mapping_level;
>         bool flush = false;
>         u64 huge_spte;
>
>         if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot)))
>                 return;
>
>         rcu_read_lock();
>
>         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
> restart:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
>                         flush = false;
>                         continue;
>                 }
>
>                 if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
>                     !is_shadow_present_pte(iter.old_spte))
>                         continue;
>
>                 /*
>                  * If iter.gfn resides outside of the slot, i.e. the page for
>                  * the current level overlaps but is not contained by the slot,
>                  * then the SPTE can't be made huge.  More importantly, trying
>                  * to query that info from slot->arch.lpage_info will cause an
>                  * out-of-bounds access.
>                  */
>                 if (iter.gfn < start || iter.gfn >= end)
>                         continue;
>
>                 max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn);
>                 if (max_mapping_level < iter.level)
>                         continue;
>
>                 r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte);
>                 if (r == -EAGAIN)
>                         goto restart;
>                 else if (r)
>                         continue;
>
>                 /*
>                  * If setting the SPTE fails, e.g. because it has been modified
>                  * by a different task, iteration will naturally continue with
>                  * the next SPTE.  Don't bother retrying this SPTE, races are
>                  * uncommon and odds are good the SPTE
>                  */
>                 if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte))
>                         flush = true;
>         }
>
>         if (flush)
>                 kvm_flush_remote_tlbs_memslot(kvm, slot);
>
>         rcu_read_unlock();
> }
>
> static inline bool tdp_mmu_iter_needs_reched(struct kvm *kvm,
>                                              struct tdp_iter *iter)
> {
>         /* Ensure forward progress has been made before yielding. */
>         return iter->next_last_level_gfn != iter->yielded_gfn &&
>                (need_resched() || rwlock_needbreak(&kvm->mmu_lock));
>
> }
>
> /*
>  * Yield if the MMU lock is contended or this thread needs to return control
>  * to the scheduler.
>  *
>  * If this function should yield and flush is set, it will perform a remote
>  * TLB flush before yielding.
>  *
>  * If this function yields, iter->yielded is set and the caller must skip to
>  * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
>  * over the paging structures to allow the iterator to continue its traversal
>  * from the paging structure root.
>  *
>  * Returns true if this function yielded.
>  */
> static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
>                                                           struct tdp_iter *iter,
>                                                           bool flush, bool shared)
> {
>         WARN_ON_ONCE(iter->yielded);
>
>         if (!tdp_mmu_iter_needs_reched(kvm, iter))
>                 return false;
>
>         if (flush)
>                 kvm_flush_remote_tlbs(kvm);
>
>         rcu_read_unlock();
>
>         if (shared)
>                 cond_resched_rwlock_read(&kvm->mmu_lock);
>         else
>                 cond_resched_rwlock_write(&kvm->mmu_lock);
>
>         rcu_read_lock();
>
>         WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn);
>
>         iter->yielded = true;
>         return true;
> }





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux