On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > On Fri, Sep 13, 2024, Yan Zhao wrote: > > This is a lock status report of TDX module for current SEAMCALL retry issue > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > branch TDX_1.5.05. > > > > TL;DR: > > - tdh_mem_track() can contend with tdh_vp_enter(). > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > whatever reason. > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > hits the fault? > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > desirable because in many cases, the winning task will install a valid mapping > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > instruction is re-executed. In the happy case, that provides optimal performance > as KVM doesn't introduce any extra delay/latency. > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > much higher, especially in light of the zero-step issues. > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > then the TDX EPT violation path can safely retry locally, similar to the do-while > loop in kvm_tdp_map_page(). > > The only part I don't like about this idea is having two "retry" return values, > which creates the potential for bugs due to checking one but not the other. > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > option better even though the out-param is a bit gross, because it makes it more > obvious that the "frozen_spte" is a special case that doesn't need attention for > most paths. Good idea. But could we extend it a bit more to allow TDX's EPT violation handler to also retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? > > > - tdg_mem_page_accept() can contend with other tdh_mem*(). > > > > Proposal: > > - Return -EAGAIN directly in ops link_external_spt/set_external_spte when > > tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY. > What is the result of returning -EAGAIN? E.g. does KVM redo tdh_vp_enter()? Sorry, I meant -EBUSY originally. With the current code in kvm_tdp_map_page(), vCPU should just retry without tdh_vp_enter() except when there're signals pending. With a real EPT violation, tdh_vp_enter() should be called again. I realized that this is not good enough. So, is it better to return -EAGAIN in ops link_external_spt/set_external_spte and have kvm_tdp_mmu_map() return RET_PF_RETRY_FROZEN for -EAGAIN? (or maybe some other name for RET_PF_RETRY_FROZEN). > Also tdh_mem_sept_add() is strictly pre-finalize, correct? I.e. should never > contend with tdg_mem_page_accept() because vCPUs can't yet be run. tdh_mem_page_add() is pre-finalize, tdh_mem_sept_add() is not. tdh_mem_sept_add() can be called runtime by tdp_mmu_link_sp(). > Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()? > The page isn't yet mapped, so why would the guest be allowed to take a lock on > the S-EPT entry? Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second tdh_mem_page_aug() is called on the same gpa, the second one may contend with tdg_mem_page_accept(). But given KVM does not allow the second tdh_mem_page_aug(), looks the contention between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen. > > > - Kick off vCPUs at the beginning of page removal path, i.e. before the > > tdh_mem_range_block(). > > Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done. > > This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS. Great! > > > (one possible optimization: > > since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare, > > do not kick off vCPUs in normal conditions. > > When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow > > Which SEAMCALL is this specifically? tdh_mem_range_block()? Yes, they are - tdh_mem_range_block() contends with tdh_vp_enter() for secure_ept_lock. - tdh_mem_track() contends with tdh_vp_enter() for TD epoch. (current code in MMU part 2 just retry tdh_mem_track() endlessly), - tdh_mem_page_remove()/tdh_mem_range_block() contends with tdg_mem_page_accept() for SEPT entry lock. (this one should not happen on a sane guest). Resources SHARED users EXCLUSIVE users ------------------------------------------------------------------------ (5) TDCS epoch tdh_vp_enter tdh_mem_track ------------------------------------------------------------------------ (6) secure_ept_lock tdh_mem_sept_add tdh_vp_enter tdh_mem_page_aug tdh_mem_sept_remove tdh_mem_page_remove tdh_mem_range_block tdh_mem_range_unblock ------------------------------------------------------------------------ (7) SEPT entry tdh_mem_sept_add tdh_mem_sept_remove tdh_mem_page_aug tdh_mem_page_remove tdh_mem_range_block tdh_mem_range_unblock tdg_mem_page_accept > > > TD enter until page removal completes.) > > > Idea #1: > --- > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index b45258285c9c..8113c17bd2f6 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, > return -EINTR; > cond_resched(); > r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); > - } while (r == RET_PF_RETRY); > + } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN); > > if (r < 0) > return r; > @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err > vcpu->stat.pf_spurious++; > > if (r != RET_PF_EMULATE) > - return 1; > + return r; > > emulate: > return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn, > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 8d3fb3c8c213..690f03d7daae 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > * and of course kvm_mmu_do_page_fault(). > * > * RET_PF_CONTINUE: So far, so good, keep handling the page fault. > + * RET_PF_FIXED: The faulting entry has been fixed. > * RET_PF_RETRY: let CPU fault again on the address. > + * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen. > + * Let the CPU fault again on the address, or retry the > + * fault "locally", i.e. without re-entering the guest. > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. > * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the > * gfn and retry, or emulate the instruction directly. > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. > - * RET_PF_FIXED: The faulting entry has been fixed. > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. > * > * Any names added to this enum should be exported to userspace for use in > @@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which > * will allow for efficient machine code when checking for CONTINUE, e.g. > * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero. > + * > + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code > + * scheme, where 1==success, translates '1' to RET_PF_FIXED. > */ Looks "r > 0" represents success in vcpu_run()? So, moving RET_PF_FIXED to 1 is not necessary? > enum { > RET_PF_CONTINUE = 0, > + RET_PF_FIXED = 1, > RET_PF_RETRY, > + RET_PF_RETRY_FROZEN, > RET_PF_EMULATE, > RET_PF_WRITE_PROTECTED, > RET_PF_INVALID, > - RET_PF_FIXED, > RET_PF_SPURIOUS, > }; > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 5a475a6456d4..cbf9e46203f3 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > retry: > rcu_read_unlock(); > + if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte)) > + return RET_PF_RETRY_FOZEN; > return ret; > } > > --- > > > Idea #2: > --- > arch/x86/include/asm/kvm_host.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 12 ++++++------ > arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++--- > arch/x86/kvm/mmu/tdp_mmu.c | 1 + > arch/x86/kvm/svm/svm.c | 2 +- > arch/x86/kvm/vmx/vmx.c | 4 ++-- > 6 files changed, 23 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 46e0a466d7fb..200fecd1de88 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr, > int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); > > int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, > - void *insn, int insn_len); > + void *insn, int insn_len, bool *frozen_spte); > void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg); > void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva); > void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index b45258285c9c..207840a316d3 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > return; > > r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, > - true, NULL, NULL); > + true, NULL, NULL, NULL); > > /* > * Account fixed page faults, otherwise they'll never be counted, but > @@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > trace_kvm_page_fault(vcpu, fault_address, error_code); > > r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn, > - insn_len); > + insn_len, NULL); > } else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) { > vcpu->arch.apf.host_apf_flags = 0; > local_irq_disable(); > @@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, > if (signal_pending(current)) > return -EINTR; > cond_resched(); > - r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); > + r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL); > } while (r == RET_PF_RETRY); > > if (r < 0) > @@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > } > > int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, > - void *insn, int insn_len) > + void *insn, int insn_len, bool *frozen_spte) > { > int r, emulation_type = EMULTYPE_PF; > bool direct = vcpu->arch.mmu->root_role.direct; > @@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err > vcpu->stat.pf_taken++; > > r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false, > - &emulation_type, NULL); > + &emulation_type, NULL, frozen_spte); > if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm)) > return -EIO; > } > @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err > vcpu->stat.pf_spurious++; > > if (r != RET_PF_EMULATE) > - return 1; > + return r; > > emulate: > return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn, > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 8d3fb3c8c213..5b1fc77695c1 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -247,6 +247,9 @@ struct kvm_page_fault { > * is changing its own translation in the guest page tables. > */ > bool write_fault_to_shadow_pgtable; > + > + /* Indicates the page fault needs to be retried due to a frozen SPTE. */ > + bool frozen_spte; > }; > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > @@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > * and of course kvm_mmu_do_page_fault(). > * > * RET_PF_CONTINUE: So far, so good, keep handling the page fault. > + * RET_PF_FIXED: The faulting entry has been fixed. > * RET_PF_RETRY: let CPU fault again on the address. > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. > * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the > * gfn and retry, or emulate the instruction directly. > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. > - * RET_PF_FIXED: The faulting entry has been fixed. > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. > * > * Any names added to this enum should be exported to userspace for use in > @@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which > * will allow for efficient machine code when checking for CONTINUE, e.g. > * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero. > + * > + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code > + * scheme, where 1==success, translates '1' to RET_PF_FIXED. > */ > enum { > RET_PF_CONTINUE = 0, > + RET_PF_FIXED = 1, > RET_PF_RETRY, > RET_PF_EMULATE, > RET_PF_WRITE_PROTECTED, > RET_PF_INVALID, > - RET_PF_FIXED, > RET_PF_SPURIOUS, > }; > > @@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > > static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > u64 err, bool prefetch, > - int *emulation_type, u8 *level) > + int *emulation_type, u8 *level, > + bool *frozen_spte) > { > struct kvm_page_fault fault = { > .addr = cr2_or_gpa, > @@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > *emulation_type |= EMULTYPE_WRITE_PF_TO_SP; > if (level) > *level = fault.goal_level; > + if (frozen_spte) > + *frozen_spte = fault.frozen_spte; > > return r; > } > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 5a475a6456d4..e7fc5ea4b437 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > retry: > rcu_read_unlock(); > + fault->frozen_spte = is_frozen_spte(iter.old_spte); > return ret; > } > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index 38723b0c435d..269de6a9eb13 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu) > rc = kvm_mmu_page_fault(vcpu, fault_address, error_code, > static_cpu_has(X86_FEATURE_DECODEASSISTS) ? > svm->vmcb->control.insn_bytes : NULL, > - svm->vmcb->control.insn_len); > + svm->vmcb->control.insn_len, NULL); > > if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK) > sev_handle_rmp_fault(vcpu, fault_address, error_code); > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 368acfebd476..fc2ff5d91a71 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) > if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa))) > return kvm_emulate_instruction(vcpu, 0); > > - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL); > } > > static int handle_ept_misconfig(struct kvm_vcpu *vcpu) > @@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) > return kvm_skip_emulated_instruction(vcpu); > } > > - return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0); > + return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL); > } > > static int handle_nmi_window(struct kvm_vcpu *vcpu) > > base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe > -- >