Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> On Fri, Sep 13, 2024, Yan Zhao wrote:
> > This is a lock status report of TDX module for current SEAMCALL retry issue
> > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > branch TDX_1.5.05.
> > 
> > TL;DR:
> > - tdh_mem_track() can contend with tdh_vp_enter().
> > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
> The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> whatever reason.
> 
> Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> hits the fault?
> 
> For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> desirable because in many cases, the winning task will install a valid mapping
> before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> instruction is re-executed.  In the happy case, that provides optimal performance
> as KVM doesn't introduce any extra delay/latency.
> 
> But for TDX, the math is different as the cost of a re-hitting a fault is much,
> much higher, especially in light of the zero-step issues.
> 
> E.g. if the TDP MMU returns a unique error code for the frozen case, and
> kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> then the TDX EPT violation path can safely retry locally, similar to the do-while
> loop in kvm_tdp_map_page().
> 
> The only part I don't like about this idea is having two "retry" return values,
> which creates the potential for bugs due to checking one but not the other.
> 
> Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> option better even though the out-param is a bit gross, because it makes it more
> obvious that the "frozen_spte" is a special case that doesn't need attention for
> most paths.
Good idea.
But could we extend it a bit more to allow TDX's EPT violation handler to also
retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?

> 
> > - tdg_mem_page_accept() can contend with other tdh_mem*().
> > 
> > Proposal:
> > - Return -EAGAIN directly in ops link_external_spt/set_external_spte when
> >   tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.
> What is the result of returning -EAGAIN? E.g. does KVM redo tdh_vp_enter()?
Sorry, I meant -EBUSY originally.

With the current code in kvm_tdp_map_page(), vCPU should just retry without
tdh_vp_enter() except when there're signals pending.
With a real EPT violation, tdh_vp_enter() should be called again.

I realized that this is not good enough.
So, is it better to return -EAGAIN in ops link_external_spt/set_external_spte
and have kvm_tdp_mmu_map() return RET_PF_RETRY_FROZEN for -EAGAIN?
(or maybe some other name for RET_PF_RETRY_FROZEN).

> Also tdh_mem_sept_add() is strictly pre-finalize, correct?  I.e. should never
> contend with tdg_mem_page_accept() because vCPUs can't yet be run.
tdh_mem_page_add() is pre-finalize, tdh_mem_sept_add() is not.
tdh_mem_sept_add() can be called runtime by tdp_mmu_link_sp().

 
> Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
> The page isn't yet mapped, so why would the guest be allowed to take a lock on
> the S-EPT entry?
Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second
tdh_mem_page_aug() is called on the same gpa, the second one may contend with
tdg_mem_page_accept().

But given KVM does not allow the second tdh_mem_page_aug(), looks the contention
between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen.

> 
> > - Kick off vCPUs at the beginning of page removal path, i.e. before the
> >   tdh_mem_range_block().
> >   Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
> 
> This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS.
Great!

> 
> >   (one possible optimization:
> >    since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
> >    do not kick off vCPUs in normal conditions.
> >    When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
> 
> Which SEAMCALL is this specifically?  tdh_mem_range_block()?
Yes, they are
- tdh_mem_range_block() contends with tdh_vp_enter() for secure_ept_lock.
- tdh_mem_track() contends with tdh_vp_enter() for TD epoch.
  (current code in MMU part 2 just retry tdh_mem_track() endlessly),
- tdh_mem_page_remove()/tdh_mem_range_block() contends with
  tdg_mem_page_accept() for SEPT entry lock.
  (this one should not happen on a sane guest).

 Resources              SHARED  users              EXCLUSIVE users      
------------------------------------------------------------------------
(5) TDCS epoch         tdh_vp_enter                tdh_mem_track
------------------------------------------------------------------------
(6) secure_ept_lock    tdh_mem_sept_add            tdh_vp_enter
                       tdh_mem_page_aug            tdh_mem_sept_remove
                       tdh_mem_page_remove         tdh_mem_range_block
                                                   tdh_mem_range_unblock
------------------------------------------------------------------------
(7) SEPT entry                                     tdh_mem_sept_add
                                                   tdh_mem_sept_remove
                                                   tdh_mem_page_aug
                                                   tdh_mem_page_remove
                                                   tdh_mem_range_block
                                                   tdh_mem_range_unblock
                                                   tdg_mem_page_accept


> 
> >    TD enter until page removal completes.)
> 
> 
> Idea #1:
> ---
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b45258285c9c..8113c17bd2f6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>                         return -EINTR;
>                 cond_resched();
>                 r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> -       } while (r == RET_PF_RETRY);
> +       } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN);
>  
>         if (r < 0)
>                 return r;
> @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>                 vcpu->stat.pf_spurious++;
>  
>         if (r != RET_PF_EMULATE)
> -               return 1;
> +               return r;
>  
>  emulate:
>         return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 8d3fb3c8c213..690f03d7daae 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * and of course kvm_mmu_do_page_fault().
>   *
>   * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
> + * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_RETRY: let CPU fault again on the address.
> + * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen.
> + *                     Let the CPU fault again on the address, or retry the
> + *                     fault "locally", i.e. without re-entering the guest.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
>   * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
>   *                         gfn and retry, or emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> - * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
>   *
>   * Any names added to this enum should be exported to userspace for use in
> @@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
>   * will allow for efficient machine code when checking for CONTINUE, e.g.
>   * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
> + *
> + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
> + * scheme, where 1==success, translates '1' to RET_PF_FIXED.
>   */
Looks "r > 0" represents success in vcpu_run()?
So, moving RET_PF_FIXED to 1 is not necessary?

>  enum {
>         RET_PF_CONTINUE = 0,
> +       RET_PF_FIXED    = 1,
>         RET_PF_RETRY,
> +       RET_PF_RETRY_FROZEN,
>         RET_PF_EMULATE,
>         RET_PF_WRITE_PROTECTED,
>         RET_PF_INVALID,
> -       RET_PF_FIXED,
>         RET_PF_SPURIOUS,
>  };
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..cbf9e46203f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  retry:
>         rcu_read_unlock();
> +       if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
> +               return RET_PF_RETRY_FOZEN;
>         return ret;
>  }
>  
> ---
> 
> 
> Idea #2:
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 12 ++++++------
>  arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++---
>  arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
>  arch/x86/kvm/svm/svm.c          |  2 +-
>  arch/x86/kvm/vmx/vmx.c          |  4 ++--
>  6 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 46e0a466d7fb..200fecd1de88 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>  
>  int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> -		       void *insn, int insn_len);
> +		       void *insn, int insn_len, bool *frozen_spte);
>  void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
>  void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
>  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b45258285c9c..207840a316d3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  		return;
>  
>  	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
> -				  true, NULL, NULL);
> +				  true, NULL, NULL, NULL);
>  
>  	/*
>  	 * Account fixed page faults, otherwise they'll never be counted, but
> @@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  		trace_kvm_page_fault(vcpu, fault_address, error_code);
>  
>  		r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
> -				insn_len);
> +				       insn_len, NULL);
>  	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
>  		vcpu->arch.apf.host_apf_flags = 0;
>  		local_irq_disable();
> @@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>  		if (signal_pending(current))
>  			return -EINTR;
>  		cond_resched();
> -		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> +		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL);
>  	} while (r == RET_PF_RETRY);
>  
>  	if (r < 0)
> @@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  }
>  
>  int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> -		       void *insn, int insn_len)
> +				void *insn, int insn_len, bool *frozen_spte)
>  {
>  	int r, emulation_type = EMULTYPE_PF;
>  	bool direct = vcpu->arch.mmu->root_role.direct;
> @@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  		vcpu->stat.pf_taken++;
>  
>  		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
> -					  &emulation_type, NULL);
> +					  &emulation_type, NULL, frozen_spte);
>  		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
>  			return -EIO;
>  	}
> @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  		vcpu->stat.pf_spurious++;
>  
>  	if (r != RET_PF_EMULATE)
> -		return 1;
> +		return r;
>  
>  emulate:
>  	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 8d3fb3c8c213..5b1fc77695c1 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -247,6 +247,9 @@ struct kvm_page_fault {
>  	 * is changing its own translation in the guest page tables.
>  	 */
>  	bool write_fault_to_shadow_pgtable;
> +
> +	/* Indicates the page fault needs to be retried due to a frozen SPTE. */
> +	bool frozen_spte;
>  };
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> @@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * and of course kvm_mmu_do_page_fault().
>   *
>   * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
> + * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_RETRY: let CPU fault again on the address.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
>   * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
>   *                         gfn and retry, or emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> - * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
>   *
>   * Any names added to this enum should be exported to userspace for use in
> @@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
>   * will allow for efficient machine code when checking for CONTINUE, e.g.
>   * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
> + *
> + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
> + * scheme, where 1==success, translates '1' to RET_PF_FIXED.
>   */
>  enum {
>  	RET_PF_CONTINUE = 0,
> +	RET_PF_FIXED    = 1,
>  	RET_PF_RETRY,
>  	RET_PF_EMULATE,
>  	RET_PF_WRITE_PROTECTED,
>  	RET_PF_INVALID,
> -	RET_PF_FIXED,
>  	RET_PF_SPURIOUS,
>  };
>  
> @@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  
>  static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  					u64 err, bool prefetch,
> -					int *emulation_type, u8 *level)
> +					int *emulation_type, u8 *level,
> +					bool *frozen_spte)
>  {
>  	struct kvm_page_fault fault = {
>  		.addr = cr2_or_gpa,
> @@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
>  	if (level)
>  		*level = fault.goal_level;
> +	if (frozen_spte)
> +		*frozen_spte = fault.frozen_spte;
>  
>  	return r;
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..e7fc5ea4b437 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  retry:
>  	rcu_read_unlock();
> +	fault->frozen_spte = is_frozen_spte(iter.old_spte);
>  	return ret;
>  }
>  
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 38723b0c435d..269de6a9eb13 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu)
>  	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
>  				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
>  				svm->vmcb->control.insn_bytes : NULL,
> -				svm->vmcb->control.insn_len);
> +				svm->vmcb->control.insn_len, NULL);
>  
>  	if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK)
>  		sev_handle_rmp_fault(vcpu, fault_address, error_code);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 368acfebd476..fc2ff5d91a71 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
>  		return kvm_emulate_instruction(vcpu, 0);
>  
> -	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL);
>  }
>  
>  static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> @@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>  		return kvm_skip_emulated_instruction(vcpu);
>  	}
>  
> -	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
> +	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL);
>  }
>  
>  static int handle_nmi_window(struct kvm_vcpu *vcpu)
> 
> base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe
> -- 
> 




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux