Re: [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 24 Jan 2025 17:23:36 -0800

On Mon, Jan 20, 2025, Yan Zhao wrote:
> On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote:
> > On Mon, Jan 13, 2025, Yan Zhao wrote:
> > I don't see any point in adding this comment, if the reader can't follow the
> > logic of this code, these comments aren't going to help them.  And the comment
> > about vcpu_run() in particular is misleading, as posted interrupts aren't truly
> > handled by vcpu_run(), rather they're handled by hardware (although KVM does send
> > a self-IPI).
> What about below version?
> 
> "
> Bail out the local retry
> - for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work()
>   --> kvm_handle_signal_exit() can exit to userspace for signal handling.

Eh, pending signals should be self-explanatory.

> - for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will
>   be re-executed for interrupt injection through posted interrupt.
> - for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be
>   re-executed to process and pend NMI to the TDX module. KVM always regards NMI
>   as allowed and the TDX module will inject it when NMI is allowed in the TD.
> "
> 
> > > +		 */
> > > +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> > > +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
> > 
> > This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
> > matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
> Yes. However, vt_nmi_allowed() is always true for TDs.
> For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is
> EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be
> EXIT_REASON_HLT.
> 
> > Ah, it's a local function.  At a glance, I don't see any harm in exposing that
> > to TDX.
> Besides that kvm_vcpu_has_events() is a local function, the consideration to
> check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) ||
> vcpu->arch.nmi_pending" instead that

*sigh*

  PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in
  the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit
  directly. When there is already one NMI pending in the TDX module, i.e. it
  has not been delivered to TDX guest yet, if there is NMI pending in KVM,
  collapse the pending NMI in KVM into the one pending in the TDX module.
  Such collapse is OK considering on X86 bare metal, multiple NMIs could
  collapse into one NMI, e.g. when NMI is blocked by SMI.  It's OS's
  responsibility to poll all NMI sources in the NMI handler to avoid missing
  handling of some NMI events. More details can be found in the changelog of
  the patch "KVM: TDX: Implement methods to inject NMI".

That's probably fine?  But it's still unfortunate that TDX manages to be different
at almost every opportunity :-(

> (1) the two are effectively equivalent for TDs (as nested is not supported yet)

If they're all equivalent, then *not* open coding is desriable, IMO.  Ah, but
they aren't equivalent.  tdx_protected_apic_has_interrupt() also checks whatever
TD_VCPU_STATE_DETAILS_NON_ARCH is.

	vcpu_state_details =
		td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);

	return tdx_vcpu_state_details_intr_pending(vcpu_state_details);

That code needs a comment, because depending on the behavior of that field, it
might not even be correct.

> (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
>     pending. However, vt_inject_exception() is NULL for TDs.

Wouldn't a pending exception be a KVM bug?

The bigger oddity, which I think is worth calling out, is that because KVM can't
determine if IRQs (or NMIs) are blocked at the time of the EPT violation, false
positives are inevitable.  I.e. KVM may re-enter the guest even if the IRQ/NMI
can't be delivered.  Call *that* out, and explain why it's fine.

> > > +			break;
> > > +
> > > +		cond_resched();
> > > +	}
> > 
> > Nit, IMO this reads better as:
> > 
> > 	do {
> > 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> > 	} while (ret == RET_PF_RETY && local_retry &&
> > 		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));
> >
> Hmm, the previous way can save one "cond_resched()" for the common cases, i.e.,
> when ret != RET_PF_RETRY or when gpa is shared .

Hrm, right.  Maybe this?  Dunno if that's any better.

	ret = 0;
	do {
		if (ret)
			cond_resched();

		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
	} while (...)