Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 17 Sep 2024 14:11:01 +1200

On 14/09/2024 5:23 am, Sean Christopherson wrote:
On Fri, Sep 13, 2024, Yan Zhao wrote:
This is a lock status report of TDX module for current SEAMCALL retry issue
based on code in TDX module public repo https://github.com/intel/tdx-module.git
branch TDX_1.5.05.

TL;DR:
- tdh_mem_track() can contend with tdh_vp_enter().
- tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.

The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
whatever reason.

Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
hits the fault?

For non-TDX, resuming the guest and letting the vCPU retry the instruction is
desirable because in many cases, the winning task will install a valid mapping
before KVM can re-run the vCPU, i.e. the fault will be fixed before the
instruction is re-executed.  In the happy case, that provides optimal performance
as KVM doesn't introduce any extra delay/latency.

But for TDX, the math is different as the cost of a re-hitting a fault is much,
much higher, especially in light of the zero-step issues.

E.g. if the TDP MMU returns a unique error code for the frozen case, and
kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
then the TDX EPT violation path can safely retry locally, similar to the do-while
loop in kvm_tdp_map_page().

The only part I don't like about this idea is having two "retry" return values,
which creates the potential for bugs due to checking one but not the other.

Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
option better even though the out-param is a bit gross, because it makes it more
obvious that the "frozen_spte" is a special case that doesn't need attention for
most paths.


[...]

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5a475a6456d4..cbf9e46203f3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
  
  retry:
         rcu_read_unlock();
+       if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
+               return RET_PF_RETRY_FOZEN;

Ack the whole "retry on frozen" approach, either with RET_PF_RETRY_FOZEN 
or fault->frozen_spte.

One minor side effect:

For normal VMs, the fault handler can also see a frozen spte, e.g, when 
kvm_tdp_mmu_map() checks the middle level SPTE:

	/*
         * If SPTE has been frozen by another thread, just give up and
         * retry, avoiding unnecessary page table allocation and free.
         */
        if (is_frozen_spte(iter.old_spte))
        	goto retry;

So for normal VM this RET_PF_RETRY_FOZEN will change "go back to guest 
to retry" to "retry in KVM internally".

As you mentioned above for normal VMs we probably always want to "go 
back to guest to retry" even for FROZEN SPTE, but I guess this is a 
minor issue that we can even notice.

Or we can additionally add:

	if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte)
			&& is_mirrored_sptep(iter.sptep))
		return RET_PF_RETRY_FOZEN;

So it only applies to TDX.