On Tue, Jan 09, 2024 at 08:21:13AM -0800, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > On Mon, Jan 08, 2024, Chao Gao wrote: > > On Fri, Jan 05, 2024 at 03:05:12PM -0800, Sean Christopherson wrote: > > >On Tue, Nov 07, 2023, isaku.yamahata@xxxxxxxxx wrote: > > >> From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > >> > > >> Wire up TDX PV HLT hypercall to the KVM backend function. > > >> > > >> Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > >> --- > > >> arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++- > > >> arch/x86/kvm/vmx/tdx.h | 3 +++ > > >> 2 files changed, 44 insertions(+), 1 deletion(-) > > >> > > >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > >> index 3a1fe74b95c3..4e48989d364f 100644 > > >> --- a/arch/x86/kvm/vmx/tdx.c > > >> +++ b/arch/x86/kvm/vmx/tdx.c > > >> @@ -662,7 +662,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > > >> > > >> bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) > > >> { > > >> - return pi_has_pending_interrupt(vcpu); > > >> + bool ret = pi_has_pending_interrupt(vcpu); > > >> + struct vcpu_tdx *tdx = to_tdx(vcpu); > > >> + > > >> + if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED) > > >> + return true; > > >> + > > >> + if (tdx->interrupt_disabled_hlt) > > >> + return false; > > >> + > > >> + /* > > >> + * This is for the case where the virtual interrupt is recognized, > > >> + * i.e. set in vmcs.RVI, between the STI and "HLT". KVM doesn't have > > >> + * access to RVI and the interrupt is no longer in the PID (because it > > >> + * was "recognized". It doesn't get delivered in the guest because the > > >> + * TDCALL completes before interrupts are enabled. > > >> + * > > >> + * TDX modules sets RVI while in an STI interrupt shadow. > > >> + * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module. > > >> + * The interrupt shadow at this point is gone. > > >> + * - It knows that there is an interrupt that can be delivered > > >> + * (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't > > >> + * matter) > > >> + * - It forwards the TDExit nevertheless, to a clueless hypervisor that > > >> + * has no way to glean either RVI or PPR. > > > > > >WTF. Seriously, what in the absolute hell is going on. I reported this internally > > >four ***YEARS*** ago. This is not some obscure theoretical edge case, this is core > > >functionality and it's completely broken garbage. > > > > > >NAK. Hard NAK. Fix the TDX module, full stop. > > > > > >Even worse, TDX 1.5 apparently _already_ has the necessary logic for dealing with > > >interrupts that are pending in RVI when handling NESTED VM-Enter. Really!?!?! > > >Y'all went and added nested virtualization support of some kind, but can't find > > >the time to get the basics right? > > > > We actually fixed the TDX module. See 11.9.5. Pending Virtual Interrupt > > Delivery Indication in TDX module 1.5 spec [1] > > > > The host VMM can detect whether a virtual interrupt is pending delivery to a > > VCPU in the Virtual APIC page, using TDH.VP.RD to read the VCPU_STATE_DETAILS > > TDVPS field. > > > > The typical use case is when the guest TD VCPU indicates to the host VMM, using > > TDG.VP.VMCALL, that it has no work to do and can be halted. The guest TD is > > expected to pass an “interrupt blocked” flag. The guest TD is expected to set > > this flag to 0 if and only if RFLAGS.IF is 1 or the TDCALL instruction that > > invokes TDG.VP.VMCALL immediately follows an STI instruction. If the “interrupt > > blocked” flag is 0, the host VMM can determine whether to re-schedule the guest > > TD VCPU based on VCPU_STATE_DETAILS. > > > > Isaku, this patch didn't read VCPU_STATE_DETAILS. Maybe you missed something > > during rebase? Regarding buggy_hlt_workaround, do you aim to avoid reading > > VCPU_STATE_DETAILS as much as possible (because reading it via SEAMCALL is > > costly, ~3-4K cycles)? > > *sigh* Why only earth doesn't the TDX module simply compute VMXIP on TDVMCALL? > It's literally one bit and one extra VMREAD. There are plenty of register bits > available, and I highly doubt ~20 cycles in the TDVMCALL path will be noticeable, > let alone problematic. Such functionality could even be added on top in a TDX > module update, and Intel could even bill it as a performance optimization. > > Eating 4k cycles in the HLT path isn't the end of the world, but it's far from > optimal and it's just so incredibly wasteful. I wouldn't be surprised if the > latency is measurable for certain workloads, which will lead to guests using > idle=poll and/or other games being played in the guest. > > And AFAICT, the TDX module doesn't support HLT passthrough, so fully dedicated > CPUs can't even mitigate the pain that way. > > Anyways, regarding the "workaround", my NAK stands. It has bad tradeoffs of its > own, e.g. will result in spurious wakeups, and can't possibly work for VMs with > passthrough devices. Not to mention that the implementation has several races > and false positives. I'll drop the last part and use TDH.VP.RD(VCPU_STATE_DETAILS) with TDX 1.5. -- Isaku Yamahata <isaku.yamahata@xxxxxxxxxxxxxxx>>