Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU under MMU read lock

Vipin Sharma <vipinsh@xxxxxxxxxx> · Fri, 23 Aug 2024 12:27:36 -0700

On 2024-08-19 15:19:19, Sean Christopherson wrote:
> On Mon, Aug 12, 2024, Vipin Sharma wrote:
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -817,9 +817,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> >  	rcu_read_unlock();
> >  }
> >  
> > -bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +static bool tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> 
> At this point, I think we should rename this to tdp_mmu_zap_possible_nx_huge_page(),
> as I can't imagine there's another use case where we'll zap a SP starting from the
> SP itself, i.e. without first walking from the root.
> 

Okay.

> >  {
> > -	u64 old_spte;
> > +	struct tdp_iter iter = {};
> 
> Rather than initializes the on-stack structure, I think it makes sense to directly
> initialize the whole thing and then WARN after, e.g. so that its easier to see
> that "iter" is simply being filled from @sp.
> 
> 	struct tdp_iter iter = {
> 		.old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0,
> 		.sptep = sp->ptep,
> 		.level = sp->role.level + 1,
> 		.gfn = sp->gfn,
> 		.as_id = kvm_mmu_page_as_id(sp),
> 	};

Okay.

> > +retry:
> > +	/*
> > +	 * Since mmu_lock is held in read mode, it's possible to race with
> > +	 * another CPU which can remove sp from the page table hierarchy.
> > +	 *
> > +	 * No need to re-read iter.old_spte as tdp_mmu_set_spte_atomic() will
> > +	 * update it in the case of failure.
> > +	 */
> > +	if (sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
> >  		return false;
> >  
> > -	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> > -			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
> > +	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
> > +		goto retry;
> 
> I'm pretty sure there's no need to retry.  Non-leaf SPTEs don't have Dirty bits,
> and KVM always sets the Accessed bit (and never clears it) for non-leaf SPTEs.
> Ditty for the Writable bit.
> 
> So unless I'm missing something, the only way for the CMPXCHG to fail is if the
> SPTE was zapped or replaced with something else, in which case the sp->spt will
> fail.  I would much prefer to WARN on that logic failing than have what appears
> to be a potential infinite loop.

I don't think we should WARN() in that scenario. Because there is
nothing wrong with someone racing with NX huge page recovery for zapping
or replacing the SPTE. This function should be just trying to zap a page
and if that didn't suceed then return the error and let caller handle
however they want to.

NX huge page recovery should be tolerant of this zapping failure and
move on to the next shadow page. May be we can put WARN if NX huge page
recovery couldn't zap any pages during its run time. For example, if it
was supposed to zap 10 pages and it couldn't zap any of them then print
using WARN_ON_ONCE. This is with the assumption that if more than 1
pages are there to be zapped then at least some of them will get zapped
whenever recovery worker kicks in.