On Fri, Sep 13, 2024, Yan Zhao wrote: > This is a lock status report of TDX module for current SEAMCALL retry issue > based on code in TDX module public repo https://github.com/intel/tdx-module.git > branch TDX_1.5.05. > > TL;DR: > - tdh_mem_track() can contend with tdh_vp_enter(). > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. The zero-step logic seems to be the most problematic. E.g. if KVM is trying to install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for whatever reason. Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still hits the fault? For non-TDX, resuming the guest and letting the vCPU retry the instruction is desirable because in many cases, the winning task will install a valid mapping before KVM can re-run the vCPU, i.e. the fault will be fixed before the instruction is re-executed. In the happy case, that provides optimal performance as KVM doesn't introduce any extra delay/latency. But for TDX, the math is different as the cost of a re-hitting a fault is much, much higher, especially in light of the zero-step issues. E.g. if the TDP MMU returns a unique error code for the frozen case, and kvm_mmu_page_fault() is modified to return the raw return code instead of '1', then the TDX EPT violation path can safely retry locally, similar to the do-while loop in kvm_tdp_map_page(). The only part I don't like about this idea is having two "retry" return values, which creates the potential for bugs due to checking one but not the other. Hmm, that could be avoided by passing a bool pointer as an out-param to communicate to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that option better even though the out-param is a bit gross, because it makes it more obvious that the "frozen_spte" is a special case that doesn't need attention for most paths. > - tdg_mem_page_accept() can contend with other tdh_mem*(). > > Proposal: > - Return -EAGAIN directly in ops link_external_spt/set_external_spte when > tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY. What is the result of returning -EAGAIN? E.g. does KVM redo tdh_vp_enter()? Also tdh_mem_sept_add() is strictly pre-finalize, correct? I.e. should never contend with tdg_mem_page_accept() because vCPUs can't yet be run. Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()? The page isn't yet mapped, so why would the guest be allowed to take a lock on the S-EPT entry? > - Kick off vCPUs at the beginning of page removal path, i.e. before the > tdh_mem_range_block(). > Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done. This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS. > (one possible optimization: > since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare, > do not kick off vCPUs in normal conditions. > When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow Which SEAMCALL is this specifically? tdh_mem_range_block()? > TD enter until page removal completes.) Idea #1: --- diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b45258285c9c..8113c17bd2f6 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, return -EINTR; cond_resched(); r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); - } while (r == RET_PF_RETRY); + } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN); if (r < 0) return r; @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err vcpu->stat.pf_spurious++; if (r != RET_PF_EMULATE) - return 1; + return r; emulate: return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn, diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 8d3fb3c8c213..690f03d7daae 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * and of course kvm_mmu_do_page_fault(). * * RET_PF_CONTINUE: So far, so good, keep handling the page fault. + * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_RETRY: let CPU fault again on the address. + * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen. + * Let the CPU fault again on the address, or retry the + * fault "locally", i.e. without re-entering the guest. * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the * gfn and retry, or emulate the instruction directly. * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. - * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. * * Any names added to this enum should be exported to userspace for use in @@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which * will allow for efficient machine code when checking for CONTINUE, e.g. * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero. + * + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code + * scheme, where 1==success, translates '1' to RET_PF_FIXED. */ enum { RET_PF_CONTINUE = 0, + RET_PF_FIXED = 1, RET_PF_RETRY, + RET_PF_RETRY_FROZEN, RET_PF_EMULATE, RET_PF_WRITE_PROTECTED, RET_PF_INVALID, - RET_PF_FIXED, RET_PF_SPURIOUS, }; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 5a475a6456d4..cbf9e46203f3 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) retry: rcu_read_unlock(); + if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte)) + return RET_PF_RETRY_FOZEN; return ret; } --- Idea #2: --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu/mmu.c | 12 ++++++------ arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++--- arch/x86/kvm/mmu/tdp_mmu.c | 1 + arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/vmx.c | 4 ++-- 6 files changed, 23 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 46e0a466d7fb..200fecd1de88 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr, int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, - void *insn, int insn_len); + void *insn, int insn_len, bool *frozen_spte); void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg); void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva); void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b45258285c9c..207840a316d3 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) return; r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, - true, NULL, NULL); + true, NULL, NULL, NULL); /* * Account fixed page faults, otherwise they'll never be counted, but @@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, trace_kvm_page_fault(vcpu, fault_address, error_code); r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn, - insn_len); + insn_len, NULL); } else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) { vcpu->arch.apf.host_apf_flags = 0; local_irq_disable(); @@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, if (signal_pending(current)) return -EINTR; cond_resched(); - r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); + r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL); } while (r == RET_PF_RETRY); if (r < 0) @@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, } int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, - void *insn, int insn_len) + void *insn, int insn_len, bool *frozen_spte) { int r, emulation_type = EMULTYPE_PF; bool direct = vcpu->arch.mmu->root_role.direct; @@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err vcpu->stat.pf_taken++; r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false, - &emulation_type, NULL); + &emulation_type, NULL, frozen_spte); if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm)) return -EIO; } @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err vcpu->stat.pf_spurious++; if (r != RET_PF_EMULATE) - return 1; + return r; emulate: return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn, diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 8d3fb3c8c213..5b1fc77695c1 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -247,6 +247,9 @@ struct kvm_page_fault { * is changing its own translation in the guest page tables. */ bool write_fault_to_shadow_pgtable; + + /* Indicates the page fault needs to be retried due to a frozen SPTE. */ + bool frozen_spte; }; int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); @@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * and of course kvm_mmu_do_page_fault(). * * RET_PF_CONTINUE: So far, so good, keep handling the page fault. + * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_RETRY: let CPU fault again on the address. * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the * gfn and retry, or emulate the instruction directly. * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. - * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. * * Any names added to this enum should be exported to userspace for use in @@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which * will allow for efficient machine code when checking for CONTINUE, e.g. * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero. + * + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code + * scheme, where 1==success, translates '1' to RET_PF_FIXED. */ enum { RET_PF_CONTINUE = 0, + RET_PF_FIXED = 1, RET_PF_RETRY, RET_PF_EMULATE, RET_PF_WRITE_PROTECTED, RET_PF_INVALID, - RET_PF_FIXED, RET_PF_SPURIOUS, }; @@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err, bool prefetch, - int *emulation_type, u8 *level) + int *emulation_type, u8 *level, + bool *frozen_spte) { struct kvm_page_fault fault = { .addr = cr2_or_gpa, @@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, *emulation_type |= EMULTYPE_WRITE_PF_TO_SP; if (level) *level = fault.goal_level; + if (frozen_spte) + *frozen_spte = fault.frozen_spte; return r; } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 5a475a6456d4..e7fc5ea4b437 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) retry: rcu_read_unlock(); + fault->frozen_spte = is_frozen_spte(iter.old_spte); return ret; } diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 38723b0c435d..269de6a9eb13 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu) rc = kvm_mmu_page_fault(vcpu, fault_address, error_code, static_cpu_has(X86_FEATURE_DECODEASSISTS) ? svm->vmcb->control.insn_bytes : NULL, - svm->vmcb->control.insn_len); + svm->vmcb->control.insn_len, NULL); if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK) sev_handle_rmp_fault(vcpu, fault_address, error_code); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 368acfebd476..fc2ff5d91a71 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa))) return kvm_emulate_instruction(vcpu, 0); - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL); } static int handle_ept_misconfig(struct kvm_vcpu *vcpu) @@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) return kvm_skip_emulated_instruction(vcpu); } - return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0); + return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL); } static int handle_nmi_window(struct kvm_vcpu *vcpu) base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe --