Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU under MMU read lock

Vipin Sharma <vipinsh@xxxxxxxxxx> · Fri, 23 Aug 2024 15:38:00 -0700

On 2024-08-19 15:12:42, Sean Christopherson wrote:
> On Mon, Aug 19, 2024, Vipin Sharma wrote:
> > On 2024-08-16 16:38:05, Sean Christopherson wrote:
> > In this patch, tdp_mmu_zap_sp() has been modified to retry failures,
> > which is similar to other retry mechanism in TDP MMU. Won't it be the
> > same issue with other TDP MMU retry flows?
> 
> Similar, but not exactly the same.  The other flows are guarnateed to make forward
> progress, as they'll never revisit a SPTE.  I.e. once a SPTE is observed to be
> !shadow-present, that SPTE will never again be processed.
> 
> This is spinning on a pre-computed variable, and waiting until that many SPs have
> been zapped.  The early break if the list is empty mostly protects against an
> infinite loop, but it's theoretically possible other tasks could keep adding and
> deleting from the list, in perpetuity.

Got it.

> What about something like this?  If the shadow page can't be zapped because
> something else was modifying it, just move on and deal with it next time.

This sounds good. I was trying to force zapping "to_zap" times
shadow pages to make it similar to existing NX huge page recovery
approach.

> But jumping back to the "we actually can [hold tdp_mmu_pages_lock]", if the zap
> is split into the actually CMPXCHG vs. handle_removed_pt() call, then the lock
> can be held while walking+zapping.  And it's quite straightforward, if we're
> willing to forego the sanity checks on the old_spte, which would require wrapping
> the sp in a struct to create a tuple.
> 
> The only part that gives me pause is the fact that it's not super obvious that,
> ignoring the tracepoint, handle_changed_spte() is just a fat wrapper for
> handle_removed_pt() when zapping a SP.
> 
> Huh.  Actually, after a lot of fiddling and staring, there's a simpler solution,
> and it would force us to comment/document an existing race that's subly ok.
> 
> For the dirty logging case, the result of kvm_mmu_sp_dirty_logging_enabled() is
> visible to the NX recovery thread before the memslot update task is guaranteed
> to finish (or even start) kvm_mmu_zap_collapsible_sptes().  I.e. KVM could
> unaccount an NX shadow page before it is zapped, and that could lead to a vCPU
> replacing the shadow page with an NX huge page.
> 
> Functionally, that's a-ok, because the accounting doesn't provide protection
> against iTLB multi-hit bug, it's there purely to prevent KVM from bouncing a gfn
> between an NX hugepage and an execute small page.  The only downside to the vCPU
> doing the replacement is that the vCPU will get saddle with tearing down all the
> child SPTEs.  But this should be a very rare race, so I can't imagine that would
> be problematic in practice.

I am worried that whenever this happens it might cause guest jitter
which we are trying to avoid as handle_changed_spte() might be keep a
vCPU busy for sometime.

> static bool tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
> 					      struct kvm_mmu_page *sp)
> {
> 	/*
> 	 * If a different task modified the SPTE, then it should be impossible
> 	 * for the SPTE to still be used for the to-be-zapped SP.  Non-leaf
> 	 * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when
> 	 * creating non-leaf SPTEs, and all other bits are immutable for non-
> 	 * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are
> 	 * zapping and replacement.
> 	 */
> 	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) {
> 		WARN_ON_ONCE(sp->spt == spte_to_child_pt(iter.old_spte, iter.level));

I responded in another patch before reading all this here. This looks
good.

> void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, unsigned long to_zap)
> 
> 		/*
> 		 * Unaccount the shadow page before zapping its SPTE so as to
> 		 * avoid bouncing tdp_mmu_pages_lock() more than is necessary.
> 		 * Clearing nx_huge_page_disallowed before zapping is safe, as
> 		 * the flag doesn't protect against iTLB multi-hit, it's there
> 		 * purely to prevent bouncing the gfn between an NX huge page
> 		 * and an X small spage.  A vCPU could get stuck tearing down
> 		 * the shadow page, e.g. if it happens to fault on the region
> 		 * before the SPTE is zapped and replaces the shadow page with
> 		 * an NX huge page and get stuck tearing down the child SPTEs,
> 		 * but that is a rare race, i.e. shouldn't impact performance.
> 		 */
> 		unaccount_nx_huge_page(kvm, sp);

Might cause jitter. A long jitter might cause an escalation.

What if I do not unaccount in the beginning, and  move page to the end
of the list only if it is still in the list? If zapping failed because
some other flow might be removing this page but it still in the
possible_nx_huge_pages list, then just move it to the end. The thread
which is removing will remove it from the list eventually.

for ( ; to_zap; --to_zap) {
	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
	if (list_empty(&kvm->arch.possible_tdp_mmu_nx_huge_pages)) {
		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
		break;
	}

	sp = list_first_entry(&kvm->arch.possible_tdp_mmu_nx_huge_pages,
			      struct kvm_mmu_page,
			      possible_nx_huge_page_link);

	WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
	WARN_ON_ONCE(!sp->role.direct);

	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

	/*
	 * Don't bother zapping shadow pages if the memslot is being
	 * dirty logged, as the relevant pages would just be faulted
	 * back in as 4KiB pages.  Potential NX Huge Pages in this slot
	 * will be recovered, along with all the other huge pages in
	 * the slot, when dirty logging is disabled.
	 */
	if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp)) {
		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
		unaccount_nx_huge_page(kvm, sp);
		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
	} else if (tdp_mmu_zap_possible_nx_huge_page(kvm, sp)) {
		flush = true;
		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
	} else {
		/*
		 * Try again in future if the page is still in the
		 * list
		 */
		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
		if (!list_empty(&sp->possible_nx_huge_page_link))
			list_move_tail(&sp->possible_nx_huge_page_link,
			kvm-> &kvm->arch.possible_nx_huge_pages);
		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
	}

	/* Resched code below */
}