Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Thu, 10 Oct 2024 13:23:17 +0800

On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> On Wed, Sep 25, 2024, Yan Zhao wrote:
> > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > branch TDX_1.5.05.
> > > > > 
> > > > > TL;DR:
> > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > 
> > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > whatever reason.
> > > > 
> > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > hits the fault?
> > > > 
> > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > desirable because in many cases, the winning task will install a valid mapping
> > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > as KVM doesn't introduce any extra delay/latency.
> > > > 
> > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > much higher, especially in light of the zero-step issues.
> > > > 
> > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > loop in kvm_tdp_map_page().
> > > > 
> > > > The only part I don't like about this idea is having two "retry" return values,
> > > > which creates the potential for bugs due to checking one but not the other.
> > > > 
> > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > option better even though the out-param is a bit gross, because it makes it more
> > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > most paths.
> > > Good idea.
> > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> 
> The goal isn't to make it completely impossible for zero-step to fire, it's to
> make it so that _if_ zero-step fires, KVM can report the error to userspace without
> having to retry, because KVM _knows_ that advancing past the zero-step isn't
> something KVM can solve.
> 
>  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
>  : not being able to differentiate between a KVM bug and guest interference.  The
>  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
>  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  : report the error to userspace instead of trying to suppress guest activity, and
>  : potentially from other KVM tasks too.
> 
> In other words, for the selftest you crafted, KVM reporting an error to userspace
> due to zero-step would be working as intended.  
Hmm, but the selftest is an example to show that 6 continuous EPT violations on
the same GPA could trigger zero-step.

For an extremely unlucky vCPU, is it still possible to fire zero step when
nothing is wrong both in KVM and QEMU?
e.g.

1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)

> > E.g. in below selftest with a TD configured with pending_ve_disable=N,
> > zero step mitigation can be triggered on a vCPU that is stuck in EPT violation
> > vm exit for more than 6 times (due to that user space does not do memslot
> > conversion correctly).
> > 
> > So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may
> > contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck
> > in EPT violation vm exits.