On 11/26/2024 6:51 AM, Sean Christopherson wrote:
[...]
When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
performing VM-Exit. In regular VMX, NMIs are blocked after such VM-Exits. With
TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
TDX-Module that propagates that blocking to SEAMCALL VMCS.
I see, thanks for the explanation!
Hmm, actually, this means that TDX has a causality inversion, which may become
visible with FRED's NMI source reporting. E.g. NMI X arrives in SEAM non-root
and triggers a VM-Exit. NMI X+1 becomes pending while SEAM root is active.
TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
and handled before NMI X.
This example can also cause an issue without FRED.
1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
2. NMI X+1 becomes pending while SEAM root is active.
3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
4. NMI X+1 is delivered and handled before NMI X.
(NMI handler could handle all NMI source events, including the source
triggered NMI X)
5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
In step 5, because the source event caused NMI X has been handled, and NMI X
will not be detected as a second half of back-to-back NMIs, according to
Linux NMI handler, it will be considered as an unknown NMI.
Actually, the issue could happen if NMI X+1 occurs after exiting to SEAM root
mode and before KVM handling the VM-Exit caused by NMI X.
So the TDX-Module needs something like this:
diff --git a/src/td_transitions/td_exit.c b/src/td_transitions/td_exit.c
index eecfb2e..b5c17c3 100644
--- a/src/td_transitions/td_exit.c
+++ b/src/td_transitions/td_exit.c
@@ -527,6 +527,11 @@ void td_vmexit_to_vmm(uint8_t vcpu_state, uint8_t last_td_exit, uint64_t scrub_m
load_xmms_by_mask(tdvps_ptr, xmm_select);
}
+ if (<is NMI VM-Exit => SEAMRET)
+ {
+ set_guest_inter_blocking_by_nmi();
+ }
+
// 7. Run the common SEAMRET routine.
tdx_vmm_post_dispatching();
and then KVM should indeed handle NMI exits prior to leaving the noinstr section.
TDX is also different because KVM isn't responsible for context switching guest
state. Specifically, CR2 is managed by the TDX Module, and so there is no window
where KVM runs with guest CR2, and thus there is no risk of clobbering guest CR2
with a host value, e.g. due to take a #PF due instrumentation triggering something.
All that said, I did forget that code that runs between guest_state_enter_irqoff()
and guest_state_exit_irqoff() can't be instrumeneted. And at least as of patch 2
in this series, the simplest way to make that happen is to tag tdx_vcpu_enter_exit()
as noinstr. Just please make sure nothing else is added in the noinstr section
unless it absolutely needs to be there.
If NMI is not a concern, is below also an option?
No, because instrumentation needs to be prohibited for the entire time between
guest_state_enter_irqoff() and guest_state_exit_irqoff().
guest_state_enter_irqoff();
instructmentation_begin();
tdh_vp_enter();
instructmentation_end();
guest_state_exit_irqoff();