Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU under MMU read lock

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 19 Aug 2024 15:12:42 -0700

On Mon, Aug 19, 2024, Vipin Sharma wrote:
> On 2024-08-16 16:38:05, Sean Christopherson wrote:
> > On Mon, Aug 12, 2024, Vipin Sharma wrote:
> > > @@ -1831,12 +1845,17 @@ ulong kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, ulong to_zap)
> > >  		 * recovered, along with all the other huge pages in the slot,
> > >  		 * when dirty logging is disabled.
> > >  		 */
> > > -		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
> > > +		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp)) {
> > > +			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > >  			unaccount_nx_huge_page(kvm, sp);
> > > -		else
> > > -			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
> > > -
> > > -		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
> > > +			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > +			to_zap--;
> > > +			WARN_ON_ONCE(sp->nx_huge_page_disallowed);
> > > +		} else if (tdp_mmu_zap_sp(kvm, sp)) {
> > > +			flush = true;
> > > +			to_zap--;
> > 
> > This is actively dangerous.  In the old code, tdp_mmu_zap_sp() could fail only
> > in a WARN-able scenario, i.e. practice was guaranteed to succeed.  And, the
> > for-loop *always* decremented to_zap, i.e. couldn't get stuck in an infinite
> > loop.
> > 
> > Neither of those protections exist in this version.  Obviously it shouldn't happen,
> > but it's possible this could flail on the same SP over and over, since nothing
> > guarnatees forward progress.  The cond_resched() would save KVM from true pain,
> > but it's still a wart in the implementation.
> > 
> > Rather than loop on to_zap, just do
> > 
> > 	list_for_each_entry(...) {
> > 
> > 		if (!to_zap)
> > 			break;
> > 	}
> > 
> > And if we don't use separate lists, that'll be an improvement too, as it KVM
> > will only have to skip "wrong" shadow pages once, whereas this approach means
> > every iteration of the loop has to walk past the "wrong" shadow pages.
> 
> TDP MMU accesses possible_nx_huge_pages using tdp_mmu_pages_lock spin
> lock. We cannot hold it for recovery duration.

Ah, right.  Hmm, we actually can.  More thoughts below.

> In this patch, tdp_mmu_zap_sp() has been modified to retry failures,
> which is similar to other retry mechanism in TDP MMU. Won't it be the
> same issue with other TDP MMU retry flows?

Similar, but not exactly the same.  The other flows are guarnateed to make forward
progress, as they'll never revisit a SPTE.  I.e. once a SPTE is observed to be
!shadow-present, that SPTE will never again be processed.

This is spinning on a pre-computed variable, and waiting until that many SPs have
been zapped.  The early break if the list is empty mostly protects against an
infinite loop, but it's theoretically possible other tasks could keep adding and
deleting from the list, in perpetuity.

Side topic, with the proposed changes, kvm_tdp_mmu_zap_sp() should return an int,
not a bool, i.e. 0/-errno, not true/false.  The existing code is specifically
returning whether or not a flush is needed, it does NOT indicate whether or not
the shadow page was successfully zapped, i.e. the !PRESENT case is treated as
being successful since something apparently already zapped the page.

[never mind, I've since come up with a better idea, but I typed  all this out,
 so I'm leaving it]

What about something like this?  If the shadow page can't be zapped because
something else was modifying it, just move on and deal with it next time.

	for ( ; to_zap; --to_zap) {
		...

		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp)) {
			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
			unaccount_nx_huge_page(kvm, sp);
			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
		} else if (!tdp_mmu_zap_sp(...)) {
			flush = true;
		} else {
			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
			list_move_tail(&sp->possible_nx_huge_page_link, kvm->
				       &kvm->arch.possible_nx_huge_pages);
			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
		}
	}

But jumping back to the "we actually can [hold tdp_mmu_pages_lock]", if the zap
is split into the actually CMPXCHG vs. handle_removed_pt() call, then the lock
can be held while walking+zapping.  And it's quite straightforward, if we're
willing to forego the sanity checks on the old_spte, which would require wrapping
the sp in a struct to create a tuple.

The only part that gives me pause is the fact that it's not super obvious that,
ignoring the tracepoint, handle_changed_spte() is just a fat wrapper for
handle_removed_pt() when zapping a SP.

Huh.  Actually, after a lot of fiddling and staring, there's a simpler solution,
and it would force us to comment/document an existing race that's subly ok.

For the dirty logging case, the result of kvm_mmu_sp_dirty_logging_enabled() is
visible to the NX recovery thread before the memslot update task is guaranteed
to finish (or even start) kvm_mmu_zap_collapsible_sptes().  I.e. KVM could
unaccount an NX shadow page before it is zapped, and that could lead to a vCPU
replacing the shadow page with an NX huge page.

Functionally, that's a-ok, because the accounting doesn't provide protection
against iTLB multi-hit bug, it's there purely to prevent KVM from bouncing a gfn
between an NX hugepage and an execute small page.  The only downside to the vCPU
doing the replacement is that the vCPU will get saddle with tearing down all the
child SPTEs.  But this should be a very rare race, so I can't imagine that would
be problematic in practice.

This contains some feedback I gathered along the, I'll respond to the original
patch since the context is lost.

static bool tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
					      struct kvm_mmu_page *sp)
{
	struct tdp_iter iter = {
		.old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0,
		.sptep = sp->ptep,
		.level = sp->role.level + 1,
		.gfn = sp->gfn,
		.as_id = kvm_mmu_page_as_id(sp),
	};

	lockdep_assert_held_read(&kvm->mmu_lock);

	/*
	 * Root shadow pages don't a parent page table and thus no associated
	 * entry, but they can never be possible NX huge pages.
	 */
	if (WARN_ON_ONCE(!sp->ptep))
		return false;

	/*
	 * Since mmu_lock is held in read mode, it's possible another task has
	 * already modified the SPTE.  Zap the SPTE if and only if the SPTE
	 * points at the SP's page table, as checking  shadow-present isn't
	 * sufficient, e.g. the SPTE could be replaced by a leaf SPTE, or even
	 * another SP.  Note, spte_to_child_pt() also checks that the SPTE is
	 * shadow-present, i.e. guards against zapping a frozen SPTE.
	 */
	if (sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
		return false;

	/*
	 * If a different task modified the SPTE, then it should be impossible
	 * for the SPTE to still be used for the to-be-zapped SP.  Non-leaf
	 * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when
	 * creating non-leaf SPTEs, and all other bits are immutable for non-
	 * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are
	 * zapping and replacement.
	 */
	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) {
		WARN_ON_ONCE(sp->spt == spte_to_child_pt(iter.old_spte, iter.level));
		return false;
	}

	return true;
}

void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, unsigned long to_zap)
{
	struct kvm_mmu_page *sp;
	bool flush = false;

	lockdep_assert_held_read(&kvm->mmu_lock);
	/*
	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
	 * be done under RCU protection, because the pages are freed via RCU
	 * callback.
	 */
	rcu_read_lock();

	for ( ; to_zap; --to_zap) {
		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
		if (list_empty(&kvm->arch.possible_tdp_mmu_nx_huge_pages))
			break;

		sp = list_first_entry(&kvm->arch.possible_tdp_mmu_nx_huge_pages,
				      struct kvm_mmu_page,
				      possible_nx_huge_page_link);

		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
		WARN_ON_ONCE(!sp->role.direct);

		/*
		 * Unaccount the shadow page before zapping its SPTE so as to
		 * avoid bouncing tdp_mmu_pages_lock() more than is necessary.
		 * Clearing nx_huge_page_disallowed before zapping is safe, as
		 * the flag doesn't protect against iTLB multi-hit, it's there
		 * purely to prevent bouncing the gfn between an NX huge page
		 * and an X small spage.  A vCPU could get stuck tearing down
		 * the shadow page, e.g. if it happens to fault on the region
		 * before the SPTE is zapped and replaces the shadow page with
		 * an NX huge page and get stuck tearing down the child SPTEs,
		 * but that is a rare race, i.e. shouldn't impact performance.
		 */
		unaccount_nx_huge_page(kvm, sp);

		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

		/*
		 * Don't bother zapping shadow pages if the memslot is being
		 * dirty logged, as the relevant pages would just be faulted
		 * back in as 4KiB pages.  Potential NX Huge Pages in this slot
		 * will be recovered, along with all the other huge pages in
		 * the slot, when dirty logging is disabled.
		 */
		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
			flush |= tdp_mmu_zap_possible_nx_huge_page(kvm, sp);

		WARN_ON_ONCE(sp->nx_huge_page_disallowed);

		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
			if (flush) {
				kvm_flush_remote_tlbs(kvm);
				flush = false;
			}
			rcu_read_unlock();
			cond_resched_rwlock_read(&kvm->mmu_lock);
			rcu_read_lock();
		}
	}

	if (flush)
		kvm_flush_remote_tlbs(kvm);
	rcu_read_unlock();
}