Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU under MMU read lock

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 26 Aug 2024 07:34:35 -0700

On Fri, Aug 23, 2024, Vipin Sharma wrote:
> On 2024-08-19 15:12:42, Sean Christopherson wrote:
> > On Mon, Aug 19, 2024, Vipin Sharma wrote:
> > Huh.  Actually, after a lot of fiddling and staring, there's a simpler solution,
> > and it would force us to comment/document an existing race that's subly ok.
> > 
> > For the dirty logging case, the result of kvm_mmu_sp_dirty_logging_enabled() is
> > visible to the NX recovery thread before the memslot update task is guaranteed
> > to finish (or even start) kvm_mmu_zap_collapsible_sptes().  I.e. KVM could
> > unaccount an NX shadow page before it is zapped, and that could lead to a vCPU
> > replacing the shadow page with an NX huge page.
> > 
> > Functionally, that's a-ok, because the accounting doesn't provide protection
> > against iTLB multi-hit bug, it's there purely to prevent KVM from bouncing a gfn
> > between an NX hugepage and an execute small page.  The only downside to the vCPU
> > doing the replacement is that the vCPU will get saddle with tearing down all the
> > child SPTEs.  But this should be a very rare race, so I can't imagine that would
> > be problematic in practice.
> 
> I am worried that whenever this happens it might cause guest jitter
> which we are trying to avoid as handle_changed_spte() might be keep a
> vCPU busy for sometime.

That race already exists today, and your series already extends the ways in which
the race can be hit.  My suggestion is to (a) explicit document that race and (b)
expand the window in which it can occur to also apply to dirty logging being off.

> > void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, unsigned long to_zap)
> > 
> > 		/*
> > 		 * Unaccount the shadow page before zapping its SPTE so as to
> > 		 * avoid bouncing tdp_mmu_pages_lock() more than is necessary.
> > 		 * Clearing nx_huge_page_disallowed before zapping is safe, as
> > 		 * the flag doesn't protect against iTLB multi-hit, it's there
> > 		 * purely to prevent bouncing the gfn between an NX huge page
> > 		 * and an X small spage.  A vCPU could get stuck tearing down
> > 		 * the shadow page, e.g. if it happens to fault on the region
> > 		 * before the SPTE is zapped and replaces the shadow page with
> > 		 * an NX huge page and get stuck tearing down the child SPTEs,
> > 		 * but that is a rare race, i.e. shouldn't impact performance.
> > 		 */
> > 		unaccount_nx_huge_page(kvm, sp);
> 
> Might cause jitter. A long jitter might cause an escalation.
> 
> What if I do not unaccount in the beginning, and  move page to the end
> of the list only if it is still in the list?

The race between kvm_mmu_sp_dirty_logging_enabled() vs. kvm_tdp_mmu_map() vs.
kvm_mmu_zap_collapsible_sptes() still exists.

> If zapping failed because some other flow might be removing this page but it
> still in the possible_nx_huge_pages list, then just move it to the end. The
> thread which is removing will remove it from the list eventually.
> 
> for ( ; to_zap; --to_zap) {
> 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> 	if (list_empty(&kvm->arch.possible_tdp_mmu_nx_huge_pages)) {
> 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> 		break;
> 	}
> 
> 	sp = list_first_entry(&kvm->arch.possible_tdp_mmu_nx_huge_pages,
> 			      struct kvm_mmu_page,
> 			      possible_nx_huge_page_link);
> 
> 	WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
> 	WARN_ON_ONCE(!sp->role.direct);
> 
> 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> 
> 
> 	/*
> 	 * Don't bother zapping shadow pages if the memslot is being
> 	 * dirty logged, as the relevant pages would just be faulted
> 	 * back in as 4KiB pages.  Potential NX Huge Pages in this slot
> 	 * will be recovered, along with all the other huge pages in
> 	 * the slot, when dirty logging is disabled.
> 	 */
> 	if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp)) {
> 		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> 		unaccount_nx_huge_page(kvm, sp);
> 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
> 	} else if (tdp_mmu_zap_possible_nx_huge_page(kvm, sp)) {
> 		flush = true;
> 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
> 	} else {
> 		/*
> 		 * Try again in future if the page is still in the
> 		 * list
> 		 */
> 		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> 		if (!list_empty(&sp->possible_nx_huge_page_link))
> 			list_move_tail(&sp->possible_nx_huge_page_link,
> 			kvm-> &kvm->arch.possible_nx_huge_pages);

This is unsafe.  The only thing that prevents a use-after-free of "sp" is the fact
that this task holds rcu_read_lock().  The sp could already been queued for freeing
via call_rcu().

> 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> 	}
> 
> 	/* Resched code below */
> }