Optimize TDP MMU's .change_pte() handler to prefetch SPTEs in the handler directly with PFN info contained in .change_pte() to avoid that each vCPU write that triggers .change_pte() must undergo twice VMExits and TDP page faults. When there's a running vCPU on current pCPU, .change_pte() is probably caused by a vCPU write to a guest page previously faulted in with a vCPU read. Detailed sequence as below: 1. vCPU reads to a guest page. Though the page is in RW memslot, both primary MMU and KVM's secondary MMU are mapped with read-only PTEs during page fault. 2. vCPU writes to this guest page. 3. VMExit and kvm_tdp_mmu_page_fault() calls GUP and COW are triggered, so .invalidate_range_start(), .change_pte() and .invalidate_range_end() are call successively. 4. kvm_tdp_mmu_page_fault() returns retry because it will always find current page fault is stale because of the increased mmu_invalidate_seq in .invalidate_range_end(). 5. VMExit and page fault again. 6. Writable SPTE is mapped successfully. That is, each guest write to a COW page must trigger VMExit and KVM TDP page fault twice though .change_pte() has notified KVM the new PTE to be mapped. Since .change_pte() is called in a point that's ensured to succeed in primary MMU, prefetch the new PFN directly in .change_pte() handler on secondary MMU (KVM MMU) can save KVM the second VMExit and TDP page fault. During tests on my environment with 8 vCPUs and 16G memory with no assigned devices, there're around 8000+ (with OVMF) and 17000+ (with Seabios) TDP page faults saved during each VM boot-up; around 44000+ TDP page faults saved during booting a L2 VM with 2G memory. Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> --- arch/x86/kvm/mmu/tdp_mmu.c | 69 +++++++++++++++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 89a1f222e823..672a1e333c92 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1243,10 +1243,77 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) */ bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { + struct kvm_mmu_page *root; + struct kvm_mmu_page *sp; + bool wrprot, writable; + struct kvm_vcpu *vcpu; + struct tdp_iter iter; + bool flush = false; + kvm_pfn_t pfn; + u64 new_spte; + /* Huge pages aren't expected to be modified */ WARN_ON(pte_huge(range->arg.pte) || range->start + 1 != range->end); - return false; + /* + * Get current running vCPU to be used in below prefetch in make_spte(). + * If no running vCPU, .change_pte() is probably not triggered by vCPU + * writes, drop prefetching SPTEs in that case. + * Also only prefetch for L1 vCPUs. + * If later the vCPU is scheduled out, it's still all right to prefetch + * with the same vCPU except the prefetched SPTE may not be accessed + * immediately. + */ + vcpu = kvm_get_running_vcpu(); + if (!vcpu || vcpu->kvm != kvm || is_guest_mode(vcpu)) + return flush; + + writable = !(range->slot->flags & KVM_MEM_READONLY) && pte_write(range->arg.pte); + pfn = pte_pfn(range->arg.pte); + + /* Do not allow rescheduling just as kvm_tdp_mmu_handle_gfn() */ + for_each_tdp_mmu_root(kvm, root, range->slot->as_id) { + rcu_read_lock(); + + tdp_root_for_each_pte(iter, root, range->start, range->end) { + if (iter.level > PG_LEVEL_4K) + continue; + + sp = sptep_to_sp(rcu_dereference(iter.sptep)); + + /* make the SPTE as prefetch */ + wrprot = make_spte(vcpu, sp, range->slot, ACC_ALL, iter.gfn, + pfn, iter.old_spte, true, true, writable, + &new_spte); + /* + * Do not prefetch new PFN for page tracked GFN + * as we want page fault handler to be triggered later + */ + if (wrprot) + continue; + + /* + * Warn if an existing SPTE is found becasuse it must not happen: + * .change_pte() must be surrounded by .invalidate_range_{start,end}(), + * so (1) kvm_unmap_gfn_range() should have zapped the old SPTE, + * (2) page fault handler should not be able to install new SPTE until + * .invalidate_range_end() completes. + * + * Even if the warn is hit and flush is true, + * (which indicates bugs in mmu notifier handler), + * there's no need to handle the remote TLB flush under RCU protection, + * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a + * shadow page. + */ + flush = WARN_ON(is_shadow_present_pte(iter.old_spte)); + tdp_mmu_iter_set_spte(kvm, &iter, new_spte); + + } + + rcu_read_unlock(); + } + + return flush; } /* -- 2.17.1