Syzkaller found a warning triggered in nested_vmx_vmexit(). vmx->nested.nested_run_pending is non-zero, even though we're in nested_vmx_vmexit(). Generally, trying to cancel a pending entry is considered a bug. However in this particular scenario, the kernel behavior seems correct. Syzkaller scenario: 1) Set up VCPU's 2) Run some code with KVM_RUN in L2 as a nested guest 3) Return from KVM_RUN 4) Inject KVM_SMI into the VCPU 5) Change the EFER register with KVM_SET_SREGS to value 0x2501 6) Run some code on the VCPU using KVM_RUN 7) Observe following behavior: kvm_smm_transition: vcpu 0: entering SMM, smbase 0x30000 kvm_entry: vcpu 0, rip 0x8000 kvm_entry: vcpu 0, rip 0x8000 kvm_entry: vcpu 0, rip 0x8002 kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x30000 kvm_nested_vmenter: rip: 0x0000000000008002 vmcs: 0x0000000000007000 nested_rip: 0x0000000000000000 int_ctl: 0x00000000 event_inj: 0x00000000 nested_ept=n guest cr3: 0x0000000000002000 kvm_nested_vmexit_inject: reason: TRIPLE_FAULT ext_inf1: 0x0000000000000000 ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 What happened here is an SMI was injected immediately and the handler was called at address 0x8000; all is good. Later, an RSM instruction is executed in an emulator to return to the nested VM. em_rsm() is called, which leads to emulator_leave_smm(). A part of this function calls VMX/SVM callback, in this case vmx_leave_smm(). It attempts to set up a pending reentry to guest VM by calling nested_vmx_enter_non_root_mode() and sets vmx->nested.nested_run_pending to one. Unfortunately, later in emulator_leave_smm(), rsm_load_state_64() fails to write invalid EFER to the MSR. This results in em_rsm() calling triple_fault callback. At this point it's clear that the KVM should call the vmexit, but vmx->nested.nested_run_pending is left set to 1. Similar flow goes for SVM, as the bug also reproduces on AMD platforms. To address this issue, reset the nested_run_pending flag in the triple_fault handler. However, it's crucial to note that nested_pending_run cannot be cleared in all cases. It should only be cleared for the specific instruction requiring hardware VM-Enter to complete the emulation, such as RSM. Previously, there were instances where KVM prematurely synthesized a triple fault on nested VM-Enter. In these cases, it is not appropriate to zero the nested_pending_run. To resolve this, introduce a new emulator flag indicating the need for HW VM-Enter to complete emulating RSM. Based on this flag, a decision can be made in vendor-specific triple fault handlers about whether nested_pending_run needs to be cleared. Fixes: 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM") Reported-by: Zheyu Ma <zheyuma97@xxxxxxxxx> Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@xxxxxxxxxxxxxx Signed-off-by: Michal Wilczynski <michal.wilczynski@xxxxxxxxx> --- v2: - added new emulator flags indicating whether an instruction needs a VM-Enter to complete emulation (Sean) - fix in SVM nested triple_fault handler (Sean) - only clear nested_run_pending on RSM instruction (Sean) arch/x86/kvm/emulate.c | 4 +++- arch/x86/kvm/kvm_emulate.h | 2 ++ arch/x86/kvm/svm/nested.c | 9 +++++++++ arch/x86/kvm/vmx/nested.c | 12 ++++++++++++ 4 files changed, 26 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e223043ef5b2..889460432eac 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -178,6 +178,7 @@ #define IncSP ((u64)1 << 54) /* SP is incremented before ModRM calc */ #define TwoMemOp ((u64)1 << 55) /* Instruction has two memory operand */ #define IsBranch ((u64)1 << 56) /* Instruction is considered a branch. */ +#define NeedVMEnter ((u64)1 << 57) /* Instruction needs HW VM-Enter to complete */ #define DstXacc (DstAccLo | SrcAccHi | SrcWrite) @@ -4462,7 +4463,7 @@ static const struct opcode twobyte_table[256] = { F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N, /* 0xA8 - 0xAF */ I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg), - II(EmulateOnUD | ImplicitOps, em_rsm, rsm), + II(EmulateOnUD | ImplicitOps | NeedVMEnter, em_rsm, rsm), F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd), F(DstMem | SrcReg | Src2CL | ModRM, em_shrd), @@ -4966,6 +4967,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int } ctxt->is_branch = opcode.flags & IsBranch; + ctxt->need_vm_enter = opcode.flags & NeedVMEnter; /* Unrecognised? */ if (ctxt->d == 0) diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index e6d149825169..1e1366afa51d 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -372,6 +372,8 @@ struct x86_emulate_ctxt { struct read_cache io_read; struct read_cache mem_read; bool is_branch; + /* instruction need a HW VM-Enter to complete correctly */ + bool need_vm_enter; }; #define KVM_EMULATOR_BUG_ON(cond, ctxt) \ diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index dee62362a360..8c19ad5e18d4 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1165,11 +1165,20 @@ int nested_svm_vmexit(struct vcpu_svm *svm) static void nested_svm_triple_fault(struct kvm_vcpu *vcpu) { + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; struct vcpu_svm *svm = to_svm(vcpu); if (!vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_SHUTDOWN)) return; + /* + * In case of a triple fault, cancel the nested reentry. This may occur + * when the RSM instruction fails while attempting to restore the state + * from SMRAM. + */ + if (ctxt->need_vm_enter) + svm->nested.nested_run_pending = 0; + kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu); nested_svm_simple_vmexit(to_svm(vcpu), SVM_EXIT_SHUTDOWN); } diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 6329a306856b..9228699b4c1e 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4950,7 +4950,19 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu) { + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; + struct vcpu_vmx *vmx = to_vmx(vcpu); + kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu); + + /* + * In case of a triple fault, cancel the nested reentry. This may occur + * when the RSM instruction fails while attempting to restore the state + * from SMRAM. + */ + if (ctxt->need_vm_enter) + vmx->nested.nested_run_pending = 0; + nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); } -- 2.41.0