On Fri, 2024-09-06 at 09:13 -0700, Sean Christopherson wrote: > Leave nested mode before synthesizing shutdown (a.k.a. TRIPLE_FAULT) if > RSM fails when resuming L2 (a.k.a. guest mode). Architecturally, shutdown > on RSM occurs _before_ the transition back to guest mode on both Intel and > AMD. > > On Intel, per the SDM pseudocode, SMRAM state is loaded before critical > VMX state: > > restore state normally from SMRAM; > ... > CR4.VMXE := value stored internally; > IF internal storage indicates that the logical processor had been in > VMX operation (root or non-root) > THEN > enter VMX operation (root or non-root); > restore VMX-critical state as defined in Section 32.14.1; > ... > restore current VMCS pointer; > FI; > > AMD's APM is both less clearcut and more explicit. Because AMD CPUs save > VMCB and guest state in SMRAM itself, given the lack of anything in the > APM to indicate a shutdown in guest mode is possible, a straightforward > reading of the clause on invalid state is that _what_ state is invalid is > irrelevant, i.e. all roads lead to shutdown. > > An RSM causes a processor shutdown if an invalid-state condition is > found in the SMRAM state-save area. > > This fixes a bug found by syzkaller where synthesizing shutdown for L2 > led to a nested VM-Exit (if L1 is intercepting shutdown), which in turn > caused KVM to complain about trying to cancel a nested VM-Enter (see > commit 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM > entry which is a result of RSM"). > > Note, Paolo pointed out that KVM shouldn't set nested_run_pending until > after loading SMRAM state. But as above, that's only half the story, KVM > shouldn't transition to guest mode either. Unfortunately, fixing that > mess requires rewriting the nVMX and nSVM RSM flows to not piggyback > their nested VM-Enter flows, as executing the nested VM-Enter flows after > loading state from SMRAM would clobber much of said state. > > For now, add a FIXME to call out that transitioning to guest mode before > loading state from SMRAM is wrong. > > Link: https://lore.kernel.org/all/CABgObfYaUHXyRmsmg8UjRomnpQ0Jnaog9-L2gMjsjkqChjDYUQ@xxxxxxxxxxxxxx > Reported-by: syzbot+988d9efcdf137bc05f66@xxxxxxxxxxxxxxxxxxxxxxxxx > Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@xxxxxxxxxx > Reported-by: Zheyu Ma <zheyuma97@xxxxxxxxx> > Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@xxxxxxxxxxxxxx > Analyzed-by: Michal Wilczynski <michal.wilczynski@xxxxxxxxx> > Cc: Kishen Maloor <kishen.maloor@xxxxxxxxx> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/smm.c | 24 +++++++++++++++++++----- > arch/x86/kvm/x86.c | 6 ------ > arch/x86/kvm/x86.h | 6 ++++++ > 3 files changed, 25 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c > index 00e3c27d2a87..85241c0c7f56 100644 > --- a/arch/x86/kvm/smm.c > +++ b/arch/x86/kvm/smm.c > @@ -624,17 +624,31 @@ int emulator_leave_smm(struct x86_emulate_ctxt *ctxt) > #endif > > /* > - * Give leave_smm() a chance to make ISA-specific changes to the vCPU > - * state (e.g. enter guest mode) before loading state from the SMM > - * state-save area. > + * FIXME: When resuming L2 (a.k.a. guest mode), the transition to guest > + * mode should happen _after_ loading state from SMRAM. However, KVM > + * piggybacks the nested VM-Enter flows (which is wrong for many other > + * reasons), and so nSVM/nVMX would clobber state that is loaded from > + * SMRAM and from the VMCS/VMCB. > */ > if (kvm_x86_call(leave_smm)(vcpu, &smram)) > return X86EMUL_UNHANDLEABLE; > > #ifdef CONFIG_X86_64 > if (guest_cpuid_has(vcpu, X86_FEATURE_LM)) > - return rsm_load_state_64(ctxt, &smram.smram64); > + ret = rsm_load_state_64(ctxt, &smram.smram64); > else > #endif > - return rsm_load_state_32(ctxt, &smram.smram32); > + ret = rsm_load_state_32(ctxt, &smram.smram32); > + > + /* > + * If RSM fails and triggers shutdown, architecturally the shutdown > + * occurs *before* the transition to guest mode. But due to KVM's > + * flawed handling of RSM to L2 (see above), the vCPU may already be > + * in_guest_mode(). Force the vCPU out of guest mode before delivering > + * the shutdown, so that L1 enters shutdown instead of seeing a VM-Exit > + * that architecturally shouldn't be possible. > + */ > + if (ret != X86EMUL_CONTINUE && is_guest_mode(vcpu)) > + kvm_leave_nested(vcpu); > + return ret; > } > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 7aecf5b4c148..d00fd0d611bb 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -833,12 +833,6 @@ static void kvm_queue_exception_vmexit(struct kvm_vcpu *vcpu, unsigned int vecto > ex->payload = payload; > } > > -/* Forcibly leave the nested mode in cases like a vCPU reset */ > -static void kvm_leave_nested(struct kvm_vcpu *vcpu) > -{ > - kvm_x86_ops.nested_ops->leave_nested(vcpu); > -} > - > static void kvm_multiple_exception(struct kvm_vcpu *vcpu, > unsigned nr, bool has_error, u32 error_code, > bool has_payload, unsigned long payload, bool reinject) > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h > index bab6f9c4a790..a84c48ef5278 100644 > --- a/arch/x86/kvm/x86.h > +++ b/arch/x86/kvm/x86.h > @@ -109,6 +109,12 @@ static inline unsigned int __shrink_ple_window(unsigned int val, > void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu); > int kvm_check_nested_events(struct kvm_vcpu *vcpu); > > +/* Forcibly leave the nested mode in cases like a vCPU reset */ > +static inline void kvm_leave_nested(struct kvm_vcpu *vcpu) > +{ > + kvm_x86_ops.nested_ops->leave_nested(vcpu); > +} > + > static inline bool kvm_vcpu_has_run(struct kvm_vcpu *vcpu) > { > return vcpu->arch.last_vmentry_cpu != -1; > > base-commit: 12680d7b8ac4db2eba6237a21a93d2b0e78a52a6 Reviewed-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx> Best regards, Maxim Levitsky