On Thu, Oct 10, 2024 at 10:33:30AM -0700, Sean Christopherson wrote: > On Thu, Oct 10, 2024, Yan Zhao wrote: > > On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote: > > > On Wed, Sep 25, 2024, Yan Zhao wrote: > > > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: > > > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > > > > > > On Fri, Sep 13, 2024, Yan Zhao wrote: > > > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue > > > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > > > > > > branch TDX_1.5.05. > > > > > > > > > > > > > > TL;DR: > > > > > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > > > > > > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > > > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > > > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > > > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > > > > > > whatever reason. > > > > > > > > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > > > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > > > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > > > > > > hits the fault? > > > > > > > > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > > > > > > desirable because in many cases, the winning task will install a valid mapping > > > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > > > > > > instruction is re-executed. In the happy case, that provides optimal performance > > > > > > as KVM doesn't introduce any extra delay/latency. > > > > > > > > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > > > > > > much higher, especially in light of the zero-step issues. > > > > > > > > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > > > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > > > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while > > > > > > loop in kvm_tdp_map_page(). > > > > > > > > > > > > The only part I don't like about this idea is having two "retry" return values, > > > > > > which creates the potential for bugs due to checking one but not the other. > > > > > > > > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > > > > > > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > > > > > > option better even though the out-param is a bit gross, because it makes it more > > > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for > > > > > > most paths. > > > > > Good idea. > > > > > But could we extend it a bit more to allow TDX's EPT violation handler to also > > > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? > > > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing > > > > FROZEN_SPTE might not be enough to prevent zero step mitigation. > > > > > > The goal isn't to make it completely impossible for zero-step to fire, it's to > > > make it so that _if_ zero-step fires, KVM can report the error to userspace without > > > having to retry, because KVM _knows_ that advancing past the zero-step isn't > > > something KVM can solve. > > > > > > : I'm not worried about any performance hit with zero-step, I'm worried about KVM > > > : not being able to differentiate between a KVM bug and guest interference. The > > > : goal with a local retry is to make it so that KVM _never_ triggers zero-step, > > > : unless there is a bug somewhere. At that point, if zero-step fires, KVM can > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > : report the error to userspace instead of trying to suppress guest activity, and > > > : potentially from other KVM tasks too. > > > > > > In other words, for the selftest you crafted, KVM reporting an error to userspace > > > due to zero-step would be working as intended. > > Hmm, but the selftest is an example to show that 6 continuous EPT violations on > > the same GPA could trigger zero-step. > > > > For an extremely unlucky vCPU, is it still possible to fire zero step when > > nothing is wrong both in KVM and QEMU? > > e.g. > > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) > > Very technically, this shouldn't be possible. The only way for there to be > contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the > 6th attempt should succeed, even if the faulting vCPU wasn't the one to create > the SPTE. Hmm, the 7th EPT violation could still occur if the vCPU that sees failure of "try_cmpxchg64()" returns to guest faster than the one that successfully installs the SPTE. > > That said, a few thoughts: > > 1. Where did we end up on the idea of requiring userspace to pre-fault memory? I didn't follow this question. Do you want to disallow userspace to pre-fault memory after TD finalization or do you want to suggest userspace to do it? > > 2. The zero-step logic really should have a slightly more conservative threshold. > I have a hard time believing that e.g. 10 attempts would create a side channel, > but 6 attempts is "fine". Don't know where the value 6 comes. :) We may need to ask. > 3. This would be a good reason to implement a local retry in kvm_tdp_mmu_map(). > Yes, I'm being somewhat hypocritical since I'm so against retrying for the > S-EPT case, but my objection to retrying for S-EPT is that it _should_ be easy > for KVM to guarantee success. It's reasonable. But TDX code still needs to retry for the RET_PF_RETRY_FROZEN without re-entering guest. Would it be good for TDX code to retry whenever it sees RET_PF_RETRY or RET_PF_RETRY_FOZEN? We can have tdx_sept_link_private_spt()/tdx_sept_set_private_spte() to return -EBUSY on contention. > > E.g. for #3, the below (compile tested only) patch should make it impossible for > the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror SPTEs > should never trigger A/D assists, i.e. retry should always succeed. > > --- > arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++++++++++++++++++++++++++++++------ > 1 file changed, 40 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 3b996c1fdaab..e47573a652a9 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1097,6 +1097,18 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter, > static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter, > struct kvm_mmu_page *sp, bool shared); > > +static struct kvm_mmu_page *tdp_mmu_realloc_sp(struct kvm_vcpu *vcpu, > + struct kvm_mmu_page *sp) > +{ > + if (!sp) > + return tdp_mmu_alloc_sp(vcpu); > + > + memset(sp, 0, sizeof(*sp)); > + memset64(sp->spt, vcpu->arch.mmu_shadow_page_cache.init_value, > + PAGE_SIZE / sizeof(u64)); > + return sp; > +} > + > /* > * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing > * page tables and SPTEs to translate the faulting guest physical address. > @@ -1104,9 +1116,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter, > int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > { > struct kvm_mmu *mmu = vcpu->arch.mmu; > + struct kvm_mmu_page *sp = NULL; > struct kvm *kvm = vcpu->kvm; > struct tdp_iter iter; > - struct kvm_mmu_page *sp; > int ret = RET_PF_RETRY; > > kvm_mmu_hugepage_adjust(vcpu, fault); > @@ -1116,8 +1128,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > rcu_read_lock(); > > tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { > - int r; > - > + /* > + * Somewhat arbitrarily allow two local retries, e.g. to play > + * nice with the extremely unlikely case that KVM encounters a > + * huge SPTE an Access-assist _and_ a subsequent Dirty-assist. > + * Retrying is inexpensive, but if KVM fails to install a SPTE > + * three times, then a fourth attempt is likely futile and it's > + * time to back off. > + */ > + int r, retry_locally = 2; > +again: > if (fault->nx_huge_page_workaround_enabled) > disallowed_hugepage_adjust(fault, iter.old_spte, iter.level); > > @@ -1140,7 +1160,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > * The SPTE is either non-present or points to a huge page that > * needs to be split. > */ > - sp = tdp_mmu_alloc_sp(vcpu); > + sp = tdp_mmu_realloc_sp(vcpu, sp); > tdp_mmu_init_child_sp(sp, &iter); > > sp->nx_huge_page_disallowed = fault->huge_page_disallowed; > @@ -1151,11 +1171,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > r = tdp_mmu_link_sp(kvm, &iter, sp, true); > > /* > - * Force the guest to retry if installing an upper level SPTE > - * failed, e.g. because a different task modified the SPTE. > + * If installing an upper level SPTE failed, retry the walk > + * locally before forcing the guest to retry. If the SPTE was > + * modified by a different task, odds are very good the new > + * SPTE is usable as-is. And if the SPTE was modified by the > + * CPU, e.g. to set A/D bits, then unless KVM gets *extremely* > + * unlucky, the CMPXCHG should succeed the second time around. > */ > if (r) { > - tdp_mmu_free_sp(sp); > + if (retry_locally--) > + goto again; > goto retry; > } > > @@ -1166,6 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > track_possible_nx_huge_page(kvm, sp); > spin_unlock(&kvm->arch.tdp_mmu_pages_lock); > } > + sp = NULL; > } > > /* > @@ -1180,6 +1206,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > retry: > rcu_read_unlock(); > + > + /* > + * Free the previously allocated MMU page if KVM retried locally and > + * ended up not using said page. > + */ > + if (sp) > + tdp_mmu_free_sp(sp); > return ret; > } > > > base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b > -- >