Re: [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Mon, 27 Jan 2025 17:24:06 +0800

On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote:
> On Mon, Jan 20, 2025, Yan Zhao wrote:
> > On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote:
> > > On Mon, Jan 13, 2025, Yan Zhao wrote:
> > > I don't see any point in adding this comment, if the reader can't follow the
> > > logic of this code, these comments aren't going to help them.  And the comment
> > > about vcpu_run() in particular is misleading, as posted interrupts aren't truly
> > > handled by vcpu_run(), rather they're handled by hardware (although KVM does send
> > > a self-IPI).
> > What about below version?
> > 
> > "
> > Bail out the local retry
> > - for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work()
> >   --> kvm_handle_signal_exit() can exit to userspace for signal handling.
> 
> Eh, pending signals should be self-explanatory.
Ok.

> 
> > - for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will
> >   be re-executed for interrupt injection through posted interrupt.
> > - for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be
> >   re-executed to process and pend NMI to the TDX module. KVM always regards NMI
> >   as allowed and the TDX module will inject it when NMI is allowed in the TD.
> > "
> > 
> > > > +		 */
> > > > +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> > > > +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
> > > 
> > > This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
> > > matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
> > Yes. However, vt_nmi_allowed() is always true for TDs.
> > For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is
> > EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be
> > EXIT_REASON_HLT.
> > 
> > > Ah, it's a local function.  At a glance, I don't see any harm in exposing that
> > > to TDX.
> > Besides that kvm_vcpu_has_events() is a local function, the consideration to
> > check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) ||
> > vcpu->arch.nmi_pending" instead that
> 
> *sigh*
> 
>   PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in
>   the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit
>   directly. When there is already one NMI pending in the TDX module, i.e. it
>   has not been delivered to TDX guest yet, if there is NMI pending in KVM,
>   collapse the pending NMI in KVM into the one pending in the TDX module.
>   Such collapse is OK considering on X86 bare metal, multiple NMIs could
>   collapse into one NMI, e.g. when NMI is blocked by SMI.  It's OS's
>   responsibility to poll all NMI sources in the NMI handler to avoid missing
>   handling of some NMI events. More details can be found in the changelog of
>   the patch "KVM: TDX: Implement methods to inject NMI".
> 
> That's probably fine?  But it's still unfortunate that TDX manages to be different
> at almost every opportunity :-(
:(

> > (1) the two are effectively equivalent for TDs (as nested is not supported yet)
> 
> If they're all equivalent, then *not* open coding is desriable, IMO.  Ah, but
> they aren't equivalent.  tdx_protected_apic_has_interrupt() also checks whatever
> TD_VCPU_STATE_DETAILS_NON_ARCH is.
> 
> 	vcpu_state_details =
> 		td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
> 
> 	return tdx_vcpu_state_details_intr_pending(vcpu_state_details);
Right. That's why I asked that in the PUCK.

The "tdx_vcpu_state_details_intr_pending(vcpu_state_details)" checks if there's
a pending interrupt that can be recognized.
As in the code snippet of TDX module:

            case MD_TDVPS_VCPU_STATE_DETAILS_FIELD_CODE:
            {
                // Calculate virtual interrupt pending status
                vcpu_state_t vcpu_state_details;
                guest_interrupt_status_t interrupt_status;
                uint64_t interrupt_status_raw;

                set_vm_vmcs_as_active(md_ctx.tdvps_ptr, 0);

                ia32_vmread(VMX_GUEST_INTERRUPT_STATUS_ENCODE, &interrupt_status_raw);
                interrupt_status.raw = (uint16_t)interrupt_status_raw;
                vcpu_state_details.raw = 0ULL;
                if ((interrupt_status.rvi & 0xF0UL) > (md_ctx.tdvps_ptr->vapic.apic[PPR_INDEX] & 0xF0UL))
                {
                    vcpu_state_details.vmxip = 1ULL;
                }
                read_value = vcpu_state_details.raw;

                break;
            }

My previous consideration is that
when there's a pending interrupt that can be recognized, given the current VM
Exit reason is EPT violation, the next VM Entry will not deliver the interrupt
since the condition to recognize and deliver interrupt is unchanged after the
EPT violation VM Exit.
So checking pending interrupt brings only false positive, which is unlike
checking PID that the vector in the PID could arrive after the EPT violation VM
Exit and PID would be cleared after VM Entry even if the interrupts are not
deliverable. So checking PID may lead to true positive and less false positive.

But I understand your point now. As checking PID can also be false positive, it
would be no harm to introduce another source of false positive.

So using kvm_vcpu_has_events() looks like a kind of trade-off?
kvm_vcpu_has_events() can make TDX's code less special but might lead to the
local vCPU more vulnerable to the 0-step mitigation, especially when interrupts
are disabled in the guest.

> 
> That code needs a comment, because depending on the behavior of that field, it
> might not even be correct.
> 
> > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
> >     pending. However, vt_inject_exception() is NULL for TDs.
> 
> Wouldn't a pending exception be a KVM bug?
Hmm, yes, it should be.
Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue
an exception for TDs, this should not occur while VCPU_RUN is in progress.

> The bigger oddity, which I think is worth calling out, is that because KVM can't
> determine if IRQs (or NMIs) are blocked at the time of the EPT violation, false
> positives are inevitable.  I.e. KVM may re-enter the guest even if the IRQ/NMI
> can't be delivered.  Call *that* out, and explain why it's fine.
Will do.

It's fine because:
In the worst-case scenario where every break is a false positive, the local vCPU
becomes more vulnerable to 0-step mitigation due to the lack of local retries.
If the local vCPU triggers a 0-step mitigation in tdh_vp_enter(), the remote
vCPUs would either retry while contending for the SEPT tree lock if they are in
the page installation path, or retry after waiting for the local vCPU to be
kicked off and ensuring no tdh_vp_enter() is occurring in the local vCPU.

Moreover, even there's no break out for interrupt injection in the local vCPU,
the local vCPU could still suffer 0-step mitigation, e.g. when an RIP faults
more than 6 GFNs.

> > > > +			break;
> > > > +
> > > > +		cond_resched();
> > > > +	}
> > > 
> > > Nit, IMO this reads better as:
> > > 
> > > 	do {
> > > 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> > > 	} while (ret == RET_PF_RETY && local_retry &&
> > > 		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));
> > >
> > Hmm, the previous way can save one "cond_resched()" for the common cases, i.e.,
> > when ret != RET_PF_RETRY or when gpa is shared .
> 
> Hrm, right.  Maybe this?  Dunno if that's any better.
> 
> 	ret = 0;
> 	do {
> 		if (ret)
> 			cond_resched();
Not sure either :)
This is clearer, however, brings one more check to the normal path.

> 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> 	} while (...)