Hi, Ben, On Mon, Nov 15, 2021 at 03:46:03PM -0800, Ben Gardon wrote: > When disabling dirty logging, the TDP MMU currently zaps each leaf entry > mapping memory in the relevant memslot. This is very slow. Doing the zaps > under the mmu read lock requires a TLB flush for every zap and the > zapping causes a storm of ETP/NPT violations. > > Instead of zapping, replace the split large pages with large page > mappings directly. While this sort of operation has historically only > been done in the vCPU page fault handler context, refactorings earlier > in this series and the relative simplicity of the TDP MMU make it > possible here as well. Thanks for this patch, it looks very useful. I've got a few comments below, but before that I've also got one off-topic question too; it'll be great if you can help answer. When I was looking into how the old code recovers the huge pages I found that we'll leave the full-zero pgtable page there until the next page fault, then I _think_ it'll be released only until the __handle_changed_spte() when we're dropping the old spte (handle_removed_tdp_mmu_page). As comment above handle_removed_tdp_mmu_page() showed, at this point IIUC current thread should have exclusive ownership of this orphaned and abandoned pgtable page, then why in handle_removed_tdp_mmu_page() we still need all the atomic operations and REMOVED_SPTE tricks to protect from concurrent access? Since that's cmpxchg-ed out of the old pgtable, what can be accessing it besides the current thread? > > Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G > of memory per vCPU, this reduces the time required to disable dirty > logging from over 45 seconds to just over 1 second. It also avoids > provoking page faults, improving vCPU performance while disabling > dirty logging. > > > Signed-off-by: Ben Gardon <bgardon@xxxxxxxxxx> > --- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/mmu/mmu_internal.h | 4 ++ > arch/x86/kvm/mmu/tdp_mmu.c | 69 ++++++++++++++++++++++++++++++++- > 3 files changed, 72 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index ef7a84422463..add724aa9e8c 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4449,7 +4449,7 @@ static inline bool boot_cpu_is_amd(void) > * the direct page table on host, use as much mmu features as > * possible, however, kvm currently does not do execution-protection. > */ > -static void > +void > build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check, > int shadow_root_level) > { > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 6563cce9c438..84d439432acf 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -161,4 +161,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); > void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > > +void > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check, > + int shadow_root_level); > + > #endif /* __KVM_X86_MMU_INTERNAL_H */ > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 43c7834b4f0a..b15c8cd11cf9 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1361,6 +1361,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, > clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot); > } > > +static void try_promote_lpage(struct kvm *kvm, > + const struct kvm_memory_slot *slot, > + struct tdp_iter *iter) > +{ > + struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep); > + struct rsvd_bits_validate shadow_zero_check; > + /* > + * Since the TDP MMU doesn't manage nested PTs, there's no need to > + * write protect for a nested VM when PML is in use. > + */ > + bool ad_need_write_protect = false; Shall we just pass in "false" in make_spte() and just move the comment there? > + bool map_writable; > + kvm_pfn_t pfn; > + u64 new_spte; > + u64 mt_mask; > + > + /* > + * If addresses are being invalidated, don't do in-place promotion to > + * avoid accidentally mapping an invalidated address. > + */ > + if (unlikely(kvm->mmu_notifier_count)) > + return; > + > + pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true, > + &map_writable, NULL); Should we better check pfn validity and bail out otherwise? E.g. for atomic I think we can also get KVM_PFN_ERR_FAULT when fast-gup failed somehow. > + > + /* > + * Can't reconstitute an lpage if the consituent pages can't be > + * mapped higher. > + */ > + if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn, > + pfn, PG_LEVEL_NUM)) > + return; > + > + build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level); > + > + /* > + * In some cases, a vCPU pointer is required to get the MT mask, > + * however in most cases it can be generated without one. If a > + * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail. > + * In that case, bail on in-place promotion. > + */ > + if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn, > + kvm_is_mmio_pfn(pfn), > + &mt_mask))) > + return; > + > + make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true, > + map_writable, ad_need_write_protect, mt_mask, > + &shadow_zero_check, &new_spte); > + > + tdp_mmu_set_spte_atomic(kvm, iter, new_spte); > + > + /* > + * Re-read the SPTE to avoid recursing into one of the removed child > + * page tables. > + */ > + iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); Is this redundant since it seems the iterator logic handles this already, I'm reading try_step_down() here: /* * Reread the SPTE before stepping down to avoid traversing into page * tables that are no longer linked from this entry. */ iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); The rest looks good to me, thanks. > +} > + > /* > * Clear leaf entries which could be replaced by large mappings, for > * GFNs within the slot. > @@ -1381,9 +1441,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm, > if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) > continue; > > - if (!is_shadow_present_pte(iter.old_spte) || > - !is_last_spte(iter.old_spte, iter.level)) > + if (!is_shadow_present_pte(iter.old_spte)) > + continue; > + > + /* Try to promote the constitutent pages to an lpage. */ > + if (!is_last_spte(iter.old_spte, iter.level)) { > + try_promote_lpage(kvm, slot, &iter); > continue; > + } > > pfn = spte_to_pfn(iter.old_spte); > if (kvm_is_reserved_pfn(pfn) || > -- > 2.34.0.rc1.387.gb447b232ab-goog > -- Peter Xu