Re: [PATCH v17 092/116] KVM: TDX: Handle TDX PV HLT hypercall

Isaku Yamahata <isaku.yamahata@xxxxxxxxxxxxxxx> · Tue, 9 Jan 2024 09:36:27 -0800



On Tue, Jan 09, 2024 at 08:21:13AM -0800,
Sean Christopherson <seanjc@xxxxxxxxxx> wrote:

> On Mon, Jan 08, 2024, Chao Gao wrote:
> > On Fri, Jan 05, 2024 at 03:05:12PM -0800, Sean Christopherson wrote:
> > >On Tue, Nov 07, 2023, isaku.yamahata@xxxxxxxxx wrote:
> > >> From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > >> 
> > >> Wire up TDX PV HLT hypercall to the KVM backend function.
> > >> 
> > >> Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > >> ---
> > >>  arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
> > >>  arch/x86/kvm/vmx/tdx.h |  3 +++
> > >>  2 files changed, 44 insertions(+), 1 deletion(-)
> > >> 
> > >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > >> index 3a1fe74b95c3..4e48989d364f 100644
> > >> --- a/arch/x86/kvm/vmx/tdx.c
> > >> +++ b/arch/x86/kvm/vmx/tdx.c
> > >> @@ -662,7 +662,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > >>  
> > >>  bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
> > >>  {
> > >> -	return pi_has_pending_interrupt(vcpu);
> > >> +	bool ret = pi_has_pending_interrupt(vcpu);
> > >> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > >> +
> > >> +	if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> > >> +		return true;
> > >> +
> > >> +	if (tdx->interrupt_disabled_hlt)
> > >> +		return false;
> > >> +
> > >> +	/*
> > >> +	 * This is for the case where the virtual interrupt is recognized,
> > >> +	 * i.e. set in vmcs.RVI, between the STI and "HLT".  KVM doesn't have
> > >> +	 * access to RVI and the interrupt is no longer in the PID (because it
> > >> +	 * was "recognized".  It doesn't get delivered in the guest because the
> > >> +	 * TDCALL completes before interrupts are enabled.
> > >> +	 *
> > >> +	 * TDX modules sets RVI while in an STI interrupt shadow.
> > >> +	 * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
> > >> +	 *   The interrupt shadow at this point is gone.
> > >> +	 * - It knows that there is an interrupt that can be delivered
> > >> +	 *   (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
> > >> +	 *    matter)
> > >> +	 * - It forwards the TDExit nevertheless, to a clueless hypervisor that
> > >> +	 *   has no way to glean either RVI or PPR.
> > >
> > >WTF.  Seriously, what in the absolute hell is going on.  I reported this internally
> > >four ***YEARS*** ago.  This is not some obscure theoretical edge case, this is core
> > >functionality and it's completely broken garbage.
> > >
> > >NAK.  Hard NAK.  Fix the TDX module, full stop.
> > >
> > >Even worse, TDX 1.5 apparently _already_ has the necessary logic for dealing with
> > >interrupts that are pending in RVI when handling NESTED VM-Enter.  Really!?!?!
> > >Y'all went and added nested virtualization support of some kind, but can't find
> > >the time to get the basics right?
> > 
> > We actually fixed the TDX module. See 11.9.5. Pending Virtual Interrupt
> > Delivery Indication in TDX module 1.5 spec [1]
> > 
> >   The host VMM can detect whether a virtual interrupt is pending delivery to a
> >   VCPU in the Virtual APIC page, using TDH.VP.RD to read the VCPU_STATE_DETAILS
> >   TDVPS field.
> >   
> >   The typical use case is when the guest TD VCPU indicates to the host VMM, using
> >   TDG.VP.VMCALL, that it has no work to do and can be halted. The guest TD is
> >   expected to pass an “interrupt blocked” flag. The guest TD is expected to set
> >   this flag to 0 if and only if RFLAGS.IF is 1 or the TDCALL instruction that
> >   invokes TDG.VP.VMCALL immediately follows an STI instruction. If the “interrupt
> >   blocked” flag is 0, the host VMM can determine whether to re-schedule the guest
> >   TD VCPU based on VCPU_STATE_DETAILS.
> > 
> > Isaku, this patch didn't read VCPU_STATE_DETAILS. Maybe you missed something
> > during rebase? Regarding buggy_hlt_workaround, do you aim to avoid reading
> > VCPU_STATE_DETAILS as much as possible (because reading it via SEAMCALL is
> > costly, ~3-4K cycles)? 
> 
> *sigh*  Why only earth doesn't the TDX module simply compute VMXIP on TDVMCALL?
> It's literally one bit and one extra VMREAD.  There are plenty of register bits
> available, and I highly doubt ~20 cycles in the TDVMCALL path will be noticeable,
> let alone problematic.  Such functionality could even be added on top in a TDX
> module update, and Intel could even bill it as a performance optimization.
> 
> Eating 4k cycles in the HLT path isn't the end of the world, but it's far from
> optimal and it's just so incredibly wasteful.  I wouldn't be surprised if the
> latency is measurable for certain workloads, which will lead to guests using
> idle=poll and/or other games being played in the guest.
> 
> And AFAICT, the TDX module doesn't support HLT passthrough, so fully dedicated
> CPUs can't even mitigate the pain that way.
> 
> Anyways, regarding the "workaround", my NAK stands.  It has bad tradeoffs of its
> own, e.g. will result in spurious wakeups, and can't possibly work for VMs with
> passthrough devices.  Not to mention that the implementation has several races
> and false positives.

I'll drop the last part and use TDH.VP.RD(VCPU_STATE_DETAILS) with TDX 1.5.
-- 
Isaku Yamahata <isaku.yamahata@xxxxxxxxxxxxxxx>>