Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU under MMU read lock

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 23 Aug 2024 13:45:40 -0700

On Fri, Aug 23, 2024, Vipin Sharma wrote:
> On 2024-08-19 15:19:19, Sean Christopherson wrote:
> > On Mon, Aug 12, 2024, Vipin Sharma wrote:
> > > +retry:
> > > +	/*
> > > +	 * Since mmu_lock is held in read mode, it's possible to race with
> > > +	 * another CPU which can remove sp from the page table hierarchy.
> > > +	 *
> > > +	 * No need to re-read iter.old_spte as tdp_mmu_set_spte_atomic() will
> > > +	 * update it in the case of failure.
> > > +	 */
> > > +	if (sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
> > >  		return false;
> > >  
> > > -	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> > > -			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
> > > +	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
> > > +		goto retry;
> > 
> > I'm pretty sure there's no need to retry.  Non-leaf SPTEs don't have Dirty bits,
> > and KVM always sets the Accessed bit (and never clears it) for non-leaf SPTEs.
> > Ditty for the Writable bit.
> > 
> > So unless I'm missing something, the only way for the CMPXCHG to fail is if the
> > SPTE was zapped or replaced with something else, in which case the sp->spt will
> > fail.  I would much prefer to WARN on that logic failing than have what appears
> > to be a potential infinite loop.
> 
> I don't think we should WARN() in that scenario. Because there is
> nothing wrong with someone racing with NX huge page recovery for zapping
> or replacing the SPTE.

Agreed, but as sketched out in my other reply[*], I'm saying KVM should WARN like so:

	/*
	 * If a different task modified the SPTE, then it should be impossible
	 * for the SPTE to still be used for the to-be-zapped SP.  Non-leaf
	 * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when
	 * creating non-leaf SPTEs, and all other bits are immutable for non-
	 * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are
	 * zapping and replacement.
	 */
	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) {
		WARN_ON_ONCE(sp->spt == spte_to_child_pt(iter.old_spte, iter.level));
		return false;
	}

[*] https://lore.kernel.org/all/ZsPDWqOiv_g7Wh_H@xxxxxxxxxx

> This function should be just trying to zap a page

Yes and no.  KVM, very deliberately, has multiple different flows for zapping
SPTEs, because the requirements, rules, and performance needs vary.

Initially, the TDP MMU had one common flow for zapping, and it was difficult to
maintain and understand, because trying to optimize and adjust for the various
scenarios in a single path was difficult and convoluted.  E.g. when zapping for
mmu_notifier invalidations, mmu_lock must be held for read, a flush is required,
KVM _must_ process stale/obsolete roots, and KVM only needs to zap leaf SPTEs.

Constrast that with zapping SPTEs in the back half of a fast zap, e.g. after a
memslot deletion, which runs with mmu_lock held for read, only processes invalid
roots, doesn't need to flush, but does need to zap _everything_.

This API is no different.  It is very specifically for zapping non-leaf, non-root
SPTEs in live roots.

So "yes", it's "just" trying to zap a page, but "no" in the sense that there are
a lot of rules and requirements behind that simple statement.

> and if that didn't suceed then return the error and let caller handle however
> they want to.  NX huge page recovery should be tolerant of this zapping
> failure and move on to the next shadow page.

Again, I agree, but that is orthogonal to the WARN I am suggesting.  The intent
of the WARN isn't to detect that zapping failed, it's to flag that the impossible
has happened.

> May be we can put WARN if NX huge page recovery couldn't zap any pages during
> its run time. For example, if it was supposed to zap 10 pages and it couldn't
> zap any of them then print using WARN_ON_ONCE.

Why?  It's improbable, but absolutely possible that zapping 10 SPTEs could fail.
Would it make sense to publish a stat so that userspace can alert on NX huge page
recovery not being as effective as it should be?  Maybe.  But the kernel should
never WARN because an unlikely scenario occurred.

If you look at all the KVM_MMU_WARN_ON() and (hopefully) WARN_ON_ONCE() calls in
KVM x86, they're either for things that should never happen absent software or
hardware bugs, or to ensure future changes update all relevant code paths.

The WARN I am suggesting is the same.  It can only fire if there's a hardware
bug, a KVM bug, or if KVM changes how it behaves, e.g. starts doing A/D tracking
on non-leaf SPTEs.

> This is with the assumption that if more than 1 pages are there to be zapped
> then at least some of them will get zapped whenever recovery worker kicks in.