Re: [RFC PATCH v2 16/69] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 13 Jul 2021 20:17:10 +0000

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:04, isaku.yamahata@xxxxxxxxx wrote:
> > From: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > 
> > Zap only leaf SPTEs when deleting/moving a memslot by default, and add a
> > module param to allow reverting to the old behavior of zapping all SPTEs
> > at all levels and memslots when any memslot is updated.
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > ---
> >   arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++++++++-
> >   1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 8d5876dfc6b7..5b8a640f8042 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -85,6 +85,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_ratio, "uint");
> >   static bool __read_mostly force_flush_and_sync_on_reuse;
> >   module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> > +static bool __read_mostly memslot_update_zap_all;
> > +module_param(memslot_update_zap_all, bool, 0444);
> > +
> >   /*
> >    * When setting this variable to true it enables Two-Dimensional-Paging
> >    * where the hardware walks 2 page tables:
> > @@ -5480,11 +5483,27 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> >   	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> >   }
> > +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> > +{
> > +	/*
> > +	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> > +	 * case scenario we'll have unused shadow pages lying around until they
> > +	 * are recycled due to age or when the VM is destroyed.
> > +	 */
> > +	write_lock(&kvm->mmu_lock);
> > +	slot_handle_level(kvm, slot, kvm_zap_rmapp, PG_LEVEL_4K,
> > +			  KVM_MAX_HUGEPAGE_LEVEL, true);
> > +	write_unlock(&kvm->mmu_lock);
> > +}
> > +
> >   static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> >   			struct kvm_memory_slot *slot,
> >   			struct kvm_page_track_notifier_node *node)
> >   {
> > -	kvm_mmu_zap_all_fast(kvm);
> > +	if (memslot_update_zap_all)
> > +		kvm_mmu_zap_all_fast(kvm);
> > +	else
> > +		kvm_mmu_zap_memslot(kvm, slot);
> >   }
> >   void kvm_mmu_init_vm(struct kvm *kvm)
> > 
> 
> This is the old patch that broke VFIO for some unknown reason.

Yes, my white whale :-/

> The commit message should at least say why memslot_update_zap_all is not true
> by default.  Also, IIUC the bug still there with NX hugepage splits disabled,

I strongly suspect the bug is also there with hugepage splits enabled, it's just
masked and/or harder to hit.

> but what if the TDP MMU is enabled?

This should not be a module param.  IIRC, the original code I wrote had it as a
per-VM flag that wasn't even exposed to the user, i.e. TDX guests always do the
partial flush and non-TDX guests always do the full flush.  I think that's the
least awful approach if we can't figure out the underlying bug before TDX is
ready for inclusion.