Re: [PATCH] KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Wed, 9 Oct 2024 16:51:28 +0800




On Fri, Oct 04, 2024 at 09:56:07AM -0700, Sean Christopherson wrote:
> On Thu, Oct 03, 2024, Paolo Bonzini wrote:
> > As was tried in commit 4e103134b862 ("KVM: x86/mmu: Zap only the relevant
> > pages when removing a memslot"), all shadow pages, i.e. non-leaf SPTEs,
> > need to be zapped.  All of the accounting for a shadow page is tied to the
> > memslot, i.e. the shadow page holds a reference to the memslot, for all
> > intents and purposes.  Deleting the memslot without removing all relevant
> > shadow pages, as is done when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled,
> > results in NULL pointer derefs when tearing down the VM.
> > 
> > Reintroduce from that commit the code that walks the whole memslot when
> > there are active shadow MMU pages.
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
Thanks and sorry for the trouble caused by I didn't test when EPT is disabled.

> > ---
> > 	In the end I did opt for zapping all the pages.  I don't see a
> > 	reason to let them linger forever in the hash table.
> > 
> > 	A small optimization would be to only check each bucket once,
> > 	which would require a bitmap sized according to the number of
> > 	buckets.  I'm not going to bother though, at least for now.
> > 
> >  arch/x86/kvm/mmu/mmu.c | 60 ++++++++++++++++++++++++++++++++----------
> >  1 file changed, 46 insertions(+), 14 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e081f785fb23..912bad4fa88c 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1884,10 +1884,14 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
> >  		if (is_obsolete_sp((_kvm), (_sp))) {			\
> >  		} else
> >  
> > -#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
> > +#define for_each_gfn_valid_sp(_kvm, _sp, _gfn)				\
> >  	for_each_valid_sp(_kvm, _sp,					\
> >  	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
> > -		if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
> > +		if ((_sp)->gfn != (_gfn)) {} else
> 
> I don't think we should provide this iterator, because it won't do what most people
> would it expect it to do.  Specifically, the "round gfn for level" adjustment that
> is done for direct SPs means that the exact gfn comparison will not get a match,
> even when a SP does "cover" a gfn, or was even created specifically for a gfn.
Right, zapping of sps with no gptes are not necessary.
When role.direct is true, the sp->gfn can even be a non-slot gfn with the leaf
entries being mmio sptes. So, it should be ok to ignore
"!sp_has_gptes(_sp) && (_sp)->gfn == (_gfn)".

Tests of "normal VM + nested VM + 3 selftests" passed on the 3 configs
1) modprobe kvm_intel ept=0,
2) modprobe kvm tdp_mmu=0
   modprobe kvm_intel ept=1
3) modprobe kvm tdp_mmu=1
   modprobe kvm_intel ept=1

with quirk disabled + below change

@@ -7071,7 +7077,7 @@ static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm,
                struct kvm_mmu_page *sp;
                gfn_t gfn = slot->base_gfn + i;

-               for_each_gfn_valid_sp(kvm, sp, gfn)
+               for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn)
                        kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);

                if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {


> For this usage specifically, KVM's behavior will vary signficantly based on the
> size and alignment of a memslot, and in weird ways.  E.g. For a 4KiB memslot,
> KVM will zap more SPs if the slot is 1GiB aligned than if it's only 4KiB aligned.
> And as described below, zapping SPs in the aligned case would overzap for direct
> MMUs, as odds are good the upper-level SPs are serving other memslots.
> 
> To iterate over all potentially-relevant gfns, KVM would need to make a pass over
> the hash table for each level, with the gfn used for lookup rounded for said level.
> And then check that the SP is of the correct level, too, e.g. to avoid over-zapping.
> 
> But even then, KVM would massively overzap, as processing every level is all but
> guaranteed to zap SPs that serve other memslots, especially if the memslot being
> removed is relatively small.  We could mitigate that by processing only levels
> that can be possible guest huge pages, but while somewhat logical, that's quite
> arbitrary and would be a bit of a mess to implement.
> 
> So, despite my initial reservations about zapping only SPs with gPTEs, I feel
> quite strongly that that's the best approach.  It's easy to describe, is predictable,
> and is explicitly minimal, i.e. KVM only zaps SPs that absolutely must be zapped.