Re: [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery

Sean Christopherson <seanjc@xxxxxxxxxx> · Thu, 22 Aug 2024 09:50:02 -0700

On Mon, Aug 05, 2024, David Matlack wrote:
> Recheck the iter.old_spte still points to a page table when recovering
> huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> another CPU in between stepping down and back up.
> 
> This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> recovery worker, or vCPUs taking faults on the huge page region).
> 
> This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> thus can see a different value, which is not immediately obvious when
> reading the code.
> 
> Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 07d5363c9db7..bdc7fd476721 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
>  		while (max_mapping_level > iter.level)
>  			tdp_iter_step_up(&iter);
>  
> +		/*
> +		 * Re-check that iter.old_spte still points to a page table.
> +		 * Since mmu_lock is held for read and tdp_iter_step_up()
> +		 * re-reads iter.sptep, it's possible the SPTE was zapped or
> +		 * recovered by another CPU in between stepping down and
> +		 * stepping back up.
> +		 */
> +		if (!is_shadow_present_pte(iter.old_spte) ||
> +		    is_last_spte(iter.old_spte, iter.level))
> +			continue;

This is the part of the step-up logic that I do not like.  Even this check doesn't
guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
only because of several very subtle mechanisms.

kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
requires knowing/remembering that KVM disallows huge pages when a gfn is write-
tracked, and relies on that never changing (which is a fairly safe bet, but the
behavior isn't fully set in stone).
not set.

And the presence of a shadow-present leaf SPTE ensures there are no in-flight
mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
return until all relevant leaf SPTEs have been zapped.

And even more subtly, recover_huge_pages_range() can install a huge SPTE while
tdp_mmu_zap_leafs() is running, e.g. if tdp_mmu_zap_leafs() is processing 4KiB
SPTEs because the greater 2MiB page is being unmapped by the primary MMU, and
tdp_mmu_zap_leafs() yields.   That's again safe only because upon regaining
control, tdp_mmu_zap_leafs() will restart at the root and thus observe and zap
the new huge SPTE.

So while I'm pretty sure this logic is functionally ok, its safety is extremely
dependent on a number of behaviors in KVM.

That said, I can't tell which option I dislike less.  E.g. we could do something
like this, where kvm_mmu_name_tbd() grabs the pfn+writable information from the
primary MMU's PTE/PMD/PUD.  Ideally, KVM would use GUP, but then KVM wouldn't
be able to create huge SPTEs for non-GUP-able memory, e.g. PFNMAP'd memory.

I don't love this either, primarily because not using GUP makes this yet another
custom flow, i.e. isn't any less tricky than reusing a child SPTE.  It does have
the advantage of not having to find a shadow-present child though, i.e. is likely
the most performant option.  I agree that that likely doesn't matter in practice,
especially since the raw latency of disabling dirty logging isn't terribly
important.

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
		    !is_shadow_present_pte(iter.old_spte))
			continue;

		/* 
		 * TODO: this should skip to the end of the parent, because if
		 * the first SPTE can't be made huger, then no SPTEs at this
		 * level can be huger.
		 */
		if (is_last_spte(iter.old_spte, iter.level))
			continue;

		/*
		 * If iter.gfn resides outside of the slot, i.e. the page for
		 * the current level overlaps but is not contained by the slot,
		 * then the SPTE can't be made huge.  More importantly, trying
		 * to query that info from slot->arch.lpage_info will cause an
		 * out-of-bounds access.
		 */
		if (iter.gfn < start || iter.gfn >= end)
			continue;

		if (kvm_mmu_name_tbd(kvm, slot, iter.gfn, &pfn, &writable,
				     &max_mapping_level))
			continue;

		/*
		 * If the range is being invalidated, do not install a SPTE as
		 * KVM may have already zapped this specific gfn, e.g. if KVM's
		 * unmapping has run to completion, but the primary MMU hasn't
		 * zapped its PTEs.  There is no need to check for *past*
		 * invalidations, because all information is gathered while
		 * holding mmu_lock, i.e. it can't have become stale due to a
		 * entire mmu_notifier invalidation sequence completing.
		 */
		if (mmu_invalidate_retry_gfn(kvm, kvm->mmu_invalidate_seq, iter.gfn))
			continue;

		/*
		 * KVM disallows huge pages for write-protected gfns, it should
		 * impossible for make_spte() to encounter such a gfn since
		 * write-protecting a gfn requires holding mmu_lock for write.
		 */
		flush |= __tdp_mmu_map_gfn(...);
		WARN_ON_ONCE(r == RET_PF_EMULATE);
	}

	rcu_read_unlock();

Assuming you don't like the above idea (I'm not sure I do), what if instead of
doing the step-up, KVM starts a second iteration using the shadow page it wants
to replace as the root of the walk?

This has the same subtle dependencies on kvm_mmu_max_mapping_level() and the
ordering with respect to an mmu_notifier invalidation, but it at least avoids
having to reason about the correctness of re-reading a SPTE and modifying the
iteration level within the body of an interation loop.

It should also yield smaller diffs overall, e.g. no revert, no separate commit
to recheck the SPTE, etc.  And I believe it's more performant that the step-up
approach when there are SPTE that _can't_ be huge, as KVM won't traverse down
into the leafs in that case.

An alternative to the tdp_mmu_iter_needs_reched() logic would be to pass in
&flush, but I think that ends up being more confusing and harder to follow.

static int tdp_mmu_make_huge_spte(struct kvm *kvm, struct tdp_iter *parent,
				  u64 *huge_spte)
{
	struct kvm_mmu_page *root = sptep_to_sp(parent->sptep);
	gfn_t start = parent->gfn;
	gfn_t end = start + ???; /* use parent's level */
	struct tdp_iter iter;

	tdp_root_for_each_leaf_pte(iter, root, start, end) 	{
		/*
		 * Use the parent iterator when checking for forward progress,
		 * so that KVM doesn't get stuck due to always yielding while
		 * walking child SPTEs.
		 */
		if (tdp_mmu_iter_needs_reched(kvm, parent))
			return -EAGAIN;

		*huge_spte = make_huge_spte(kvm, iter.old_spte);
		return 0;
	}

	return -ENOENT;
}

static void recover_huge_pages_range(struct kvm *kvm,
				     struct kvm_mmu_page *root,
				     const struct kvm_memory_slot *slot)
{
	gfn_t start = slot->base_gfn;
	gfn_t end = start + slot->npages;
	struct tdp_iter iter;
	int max_mapping_level;
	bool flush = false;
	u64 huge_spte;

	if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot)))
		return;

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
restart:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
			flush = false;
			continue;
		}

		if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
		    !is_shadow_present_pte(iter.old_spte))
			continue;

                /*
                 * If iter.gfn resides outside of the slot, i.e. the page for
                 * the current level overlaps but is not contained by the slot,
                 * then the SPTE can't be made huge.  More importantly, trying
                 * to query that info from slot->arch.lpage_info will cause an
                 * out-of-bounds access.
                 */
                if (iter.gfn < start || iter.gfn >= end)
                        continue;

                max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn);
                if (max_mapping_level < iter.level)
                        continue;

		r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte);
		if (r == -EAGAIN)
			goto restart;
		else if (r)
			continue;

		/*
		 * If setting the SPTE fails, e.g. because it has been modified
		 * by a different task, iteration will naturally continue with
		 * the next SPTE.  Don't bother retrying this SPTE, races are
		 * uncommon and odds are good the SPTE 
		 */
		if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte))
			flush = true;
	}

	if (flush)
		kvm_flush_remote_tlbs_memslot(kvm, slot);

	rcu_read_unlock();
}

static inline bool tdp_mmu_iter_needs_reched(struct kvm *kvm,
					     struct tdp_iter *iter)
{
	/* Ensure forward progress has been made before yielding. */
	return iter->next_last_level_gfn != iter->yielded_gfn &&
	       (need_resched() || rwlock_needbreak(&kvm->mmu_lock));

}

/*
 * Yield if the MMU lock is contended or this thread needs to return control
 * to the scheduler.
 *
 * If this function should yield and flush is set, it will perform a remote
 * TLB flush before yielding.
 *
 * If this function yields, iter->yielded is set and the caller must skip to
 * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
 * over the paging structures to allow the iterator to continue its traversal
 * from the paging structure root.
 *
 * Returns true if this function yielded.
 */
static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
							  struct tdp_iter *iter,
							  bool flush, bool shared)
{
	WARN_ON_ONCE(iter->yielded);

	if (!tdp_mmu_iter_needs_reched(kvm, iter))
		return false;

	if (flush)
		kvm_flush_remote_tlbs(kvm);

	rcu_read_unlock();

	if (shared)
		cond_resched_rwlock_read(&kvm->mmu_lock);
	else
		cond_resched_rwlock_write(&kvm->mmu_lock);

	rcu_read_lock();

	WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn);

	iter->yielded = true;
	return true;
}