Tests of "normal VM + nested VM + 3 selftests" passed on the 3 configs 1) modprobe kvm_intel ept=0, 2) modprobe kvm tdp_mmu=0 modprobe kvm_intel ept=1 3) modprobe kvm tdp_mmu=1 modprobe kvm_intel ept=1 Reviewed-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> Tested-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> On Wed, Oct 09, 2024 at 12:23:43PM -0700, Sean Christopherson wrote: > When performing a targeted zap on memslot removal, zap only MMU pages that > shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and > unnecessary. Furthermore, for_each_gfn_valid_sp() arguably shouldn't > exist, because it doesn't do what most people would it expect it to do. > The "round gfn for level" adjustment that is done for direct SPs (no gPTE) > means that the exact gfn comparison will not get a match, even when a SP > does "cover" a gfn, or was even created specifically for a gfn. > > For memslot deletion specifically, KVM's behavior will vary significantly > based on the size and alignment of a memslot, and in weird ways. E.g. for > a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if > it's only 4KiB aligned. And as described below, zapping SPs in the > aligned case overzaps for direct MMUs, as odds are good the upper-level > SPs are serving other memslots. > > To iterate over all potentially-relevant gfns, KVM would need to make a > pass over the hash table for each level, with the gfn used for lookup > rounded for said level. And then check that the SP is of the correct > level, too, e.g. to avoid over-zapping. > > But even then, KVM would massively overzap, as processing every level is > all but guaranteed to zap SPs that serve other memslots, especially if the > memslot being removed is relatively small. KVM could mitigate that issue > by processing only levels that can be possible guest huge pages, i.e. are > less likely to be re-used for other memslot, but while somewhat logical, > that's quite arbitrary and would be a bit of a mess to implement. > > So, zap only SPs with gPTEs, as the resulting behavior is easy to describe, > is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that > absolutely must be zapped. > > Cc: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/mmu/mmu.c | 16 ++++++---------- > 1 file changed, 6 insertions(+), 10 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index a9a23e058555..09494d01c38e 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -1884,14 +1884,10 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp) > if (is_obsolete_sp((_kvm), (_sp))) { \ > } else > > -#define for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ > +#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \ > for_each_valid_sp(_kvm, _sp, \ > &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \ > - if ((_sp)->gfn != (_gfn)) {} else > - > -#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \ > - for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ > - if (!sp_has_gptes(_sp)) {} else > + if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else > > static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) > { > @@ -7063,15 +7059,15 @@ static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm, > > /* > * Since accounting information is stored in struct kvm_arch_memory_slot, > - * shadow pages deletion (e.g. unaccount_shadowed()) requires that all > - * gfns with a shadow page have a corresponding memslot. Do so before > - * the memslot goes away. > + * all MMU pages that are shadowing guest PTEs must be zapped before the > + * memslot is deleted, as freeing such pages after the memslot is freed > + * will result in use-after-free, e.g. in unaccount_shadowed(). > */ > for (i = 0; i < slot->npages; i++) { > struct kvm_mmu_page *sp; > gfn_t gfn = slot->base_gfn + i; > > - for_each_gfn_valid_sp(kvm, sp, gfn) > + for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) > kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); > > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { > -- > 2.47.0.rc1.288.g06298d1525-goog >