On Thu, 2022-03-10 at 22:38 +0100, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" <maciej.szmigiero@xxxxxxxxxx> > > In SVM synthetic software interrupts or INT3 or INTO exception that L1 > wants to inject into its L2 guest are forgotten if there is an intervening > L0 VMEXIT during their delivery. > > They are re-injected correctly with VMX, however. > > This is because there is an assumption in SVM that such exceptions will be > re-delivered by simply re-executing the current instruction. > Which might not be true if this is a synthetic exception injected by L1, > since in this case the re-executed instruction will be one already in L2, > not the VMRUN instruction in L1 that attempted the injection. > > Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until > it is either re-injected successfully or returned to L1 upon a nested > VMEXIT. > Make sure to always re-queue such event if returned in EXITINTINFO. > > The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid > unforeseen regressions. Some time ago I noticed this too, but haven't dug into this too much. I rememeber I even had some half-baked patch for this I never posted, because I didn't think about the possibility of this syntetic injection. Just to be clear that I understand this correctly: 1. What is happening is that L1 is injecting INTn/INTO/INT3 but L2 code doesn't actualy contain an INTn/INTO/INT3 instruction. This is wierd but legal thing to do. Again, if L2 actually contained the instruction, it would have worked? 2. When actual INTn/INT0/INT3 are intercepted on SVM, then save.RIP points to address of the instruction, and control.next_rip points to address of next instruction after (as expected) 3. When EVENTINJ is used with '(TYPE = 3) with vectors 3 or 4' or 'TYPE=4', then next_rip is pushed on the stack, while save.RIP is pretty much ignored, and exectution jumps to the handler in the IDT. also at least for INT3/INTO, PRM states that IDT's DPL field is checked before dispatch, meaning that we can get legit #GP during delivery. (this looks like another legit reason to fix exception merging in KVM) Best regards, Maxim Levitsky > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@xxxxxxxxxx> > --- > arch/x86/kvm/svm/nested.c | 65 +++++++++++++++++++++++++++++++++++++-- > arch/x86/kvm/svm/svm.c | 17 ++++++++-- > arch/x86/kvm/svm/svm.h | 47 ++++++++++++++++++++++++++++ > 3 files changed, 125 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c > index 9656f0d6815c..75017bf77955 100644 > --- a/arch/x86/kvm/svm/nested.c > +++ b/arch/x86/kvm/svm/nested.c > @@ -420,8 +420,17 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm, > void nested_sync_control_from_vmcb02(struct vcpu_svm *svm) > { > u32 mask; > - svm->nested.ctl.event_inj = svm->vmcb->control.event_inj; > - svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err; > + > + /* > + * Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} > + * if its re-injection is needed > + */ > + if (!exit_during_event_injection(svm, svm->nested.ctl.event_inj, > + svm->nested.ctl.event_inj_err)) { > + WARN_ON_ONCE(svm->vmcb->control.event_inj & SVM_EVTINJ_VALID); > + svm->nested.ctl.event_inj = svm->vmcb->control.event_inj; > + svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err; > + } Beware that this could backfire in regard to nested migration. I once chased a nasty bug related to this. The bug was: - L1 does VMRUN with injection (EVENTINJ set) - VMRUN exit handler is 'executed' by KVM, This copies EVENINJ fields from VMCB12 to VMCB02 - Once VMRUN exit handler is done executing, we exit to userspace to start migration (basically static_call(kvm_x86_handle_exit)(...) handles the SVM_EXIT_VMRUN, and that is all, vcpu_enter_guest isn't called again, so injection is not canceled) - migration happens and it migrates the control area of vmcb02 with EVENTINJ fields set. - on migration target, we inject another interrupt to the guest via EVENTINJ because svm_check_nested_events checks nested_run_pending to avoid doing this but nested_run_pending was not migrated correctly, and overwrites the EVENTINJ - injection is lost. Paolo back then proposed to me that instead of doing direct copy from VMCB12 to VMCB02 we should instead go through 'vcpu->arch.interrupt' and such. I had a prototype of this but never gotten to clean it up to be accepted upstream, knowing that current way also works. > > /* Only a few fields of int_ctl are written by the processor. */ > mask = V_IRQ_MASK | V_TPR_MASK; > @@ -669,6 +678,54 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to > to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl; > } > > +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu) A personal taste note: I don't like the 'maybe' for some reason. But I won't fight over this. > +{ > + struct vcpu_svm *svm = to_svm(vcpu); > + unsigned int vector, type; > + u32 exitintinfo = svm->vmcb->control.exit_int_info; > + > + if (WARN_ON_ONCE(!is_guest_mode(vcpu))) > + return; > + > + /* > + * No L1 -> L2 event to re-inject? > + * > + * In this case event_inj will be cleared by > + * nested_sync_control_from_vmcb02(). > + */ > + if (!(svm->nested.ctl.event_inj & SVM_EVTINJ_VALID)) > + return; > + > + /* If the last event injection was successful there shouldn't be any pending event */ > + if (WARN_ON_ONCE(!(exitintinfo & SVM_EXITINTINFO_VALID))) > + return; > + > + kvm_make_request(KVM_REQ_EVENT, vcpu); > + > + vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK; > + type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK; > + > + switch (type) { > + case SVM_EXITINTINFO_TYPE_NMI: > + vcpu->arch.nmi_injected = true; > + break; > + case SVM_EXITINTINFO_TYPE_EXEPT: > + if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) > + kvm_requeue_exception_e(vcpu, vector, > + svm->vmcb->control.exit_int_info_err); > + else > + kvm_requeue_exception(vcpu, vector); > + break; > + case SVM_EXITINTINFO_TYPE_SOFT: Note that AFAIK, SVM_EXITINTINFO_TYPE_SOFT is only for INTn instructions, while INT3 and INTO are considered normal exceptions but EVENTINJ handling has special case for them. On VMX it is a bit cleaner: It has: 3 - normal stock exception caused by CPU itself, like #PF and such 4 - INTn * does DPL check and uses VM_EXIT_INSTRUCTION_LEN 5 - ICEBP/INT1, which SVM doesnt support to re-inject * doesn't do DPL check, but uses VM_EXIT_INSTRUCTION_LEN I think 6 - software exception (INT3/INTO) * does DPL check and uses VM_EXIT_INSTRUCTION_LEN as well I don't know if there is any difference between 4 and 6. > + case SVM_EXITINTINFO_TYPE_INTR: > + kvm_queue_interrupt(vcpu, vector, type == SVM_EXITINTINFO_TYPE_SOFT); > + break; > + default: > + vcpu_unimpl(vcpu, "unknown L1 -> L2 exitintinfo type 0x%x\n", type); > + break; > + } > +} > + > int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, > struct vmcb *vmcb12, bool from_vmrun) > { > @@ -898,6 +955,10 @@ int nested_svm_vmexit(struct vcpu_svm *svm) > if (svm->nrips_enabled) > vmcb12->control.next_rip = vmcb->control.next_rip; > > + /* Forget about any pending L1 event injection since it's a L1 worry now */ > + svm->nested.ctl.event_inj = 0; > + svm->nested.ctl.event_inj_err = 0; > + > vmcb12->control.int_ctl = svm->nested.ctl.int_ctl; > vmcb12->control.tlb_ctl = svm->nested.ctl.tlb_ctl; > vmcb12->control.event_inj = svm->nested.ctl.event_inj; > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index 1e5d904aeec3..5b128baa5e57 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -3322,13 +3322,18 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu) > { > struct vcpu_svm *svm = to_svm(vcpu); > > - WARN_ON(!gif_set(svm)); > + WARN_ON(!(vcpu->arch.interrupt.soft || gif_set(svm))); > > trace_kvm_inj_virq(vcpu->arch.interrupt.nr); > ++vcpu->stat.irq_injections; > > svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr | > - SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR; > + SVM_EVTINJ_VALID; > + if (vcpu->arch.interrupt.soft) { > + svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_SOFT; > + } else { > + svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_INTR; > + } > } > > void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode, > @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu) > if (!(exitintinfo & SVM_EXITINTINFO_VALID)) > return; > > + /* L1 -> L2 event re-injection needs a different handling */ > + if (is_guest_mode(vcpu) && > + exit_during_event_injection(svm, svm->nested.ctl.event_inj, > + svm->nested.ctl.event_inj_err)) { > + nested_svm_maybe_reinject(vcpu); > + return; > + } > + > kvm_make_request(KVM_REQ_EVENT, vcpu); > > vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK; > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h > index f757400fc933..7cafc2e6c82a 100644 > --- a/arch/x86/kvm/svm/svm.h > +++ b/arch/x86/kvm/svm/svm.h > @@ -488,6 +488,52 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm) > return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE; > } > > +static inline bool event_inj_same(u32 event_inj1, u32 event_inj_err1, > + u32 event_inj2, u32 event_inj_err2) > +{ > + unsigned int vector_1, vector_2, type_1, type_2; > + > + /* Either of them not valid? */ > + if (!(event_inj1 & SVM_EVTINJ_VALID) || > + !(event_inj2 & SVM_EVTINJ_VALID)) > + return false; > + > + vector_1 = event_inj1 & SVM_EVTINJ_VEC_MASK; > + type_1 = event_inj1 & SVM_EVTINJ_TYPE_MASK; > + vector_2 = event_inj2 & SVM_EVTINJ_VEC_MASK; > + type_2 = event_inj2 & SVM_EVTINJ_TYPE_MASK; > + > + /* Different vector or type? */ > + if (vector_1 != vector_2 || type_1 != type_2) > + return false; > + > + /* Different error code presence flag? */ > + if ((event_inj1 & SVM_EVTINJ_VALID_ERR) != > + (event_inj2 & SVM_EVTINJ_VALID_ERR)) > + return false; > + > + /* No error code? */ > + if (!(event_inj1 & SVM_EVTINJ_VALID_ERR)) > + return true; > + > + /* Same error code? */ > + return event_inj_err1 == event_inj_err2; > +} > + > +/* Did the last VMEXIT happen when attempting to inject that event? */ > +static inline bool exit_during_event_injection(struct vcpu_svm *svm, > + u32 event_inj, u32 event_inj_err) > +{ > + BUILD_BUG_ON(SVM_EXITINTINFO_VEC_MASK != SVM_EVTINJ_VEC_MASK || > + SVM_EXITINTINFO_TYPE_MASK != SVM_EVTINJ_TYPE_MASK || > + SVM_EXITINTINFO_VALID != SVM_EVTINJ_VALID || > + SVM_EXITINTINFO_VALID_ERR != SVM_EVTINJ_VALID_ERR); > + > + return event_inj_same(svm->vmcb->control.exit_int_info, > + svm->vmcb->control.exit_int_info_err, > + event_inj, event_inj_err); > +} > + > /* svm.c */ > #define MSR_INVALID 0xffffffffU > > @@ -540,6 +586,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm) > return vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_NMI); > } > > +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu); > int enter_svm_guest_mode(struct kvm_vcpu *vcpu, > u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun); > void svm_leave_nested(struct kvm_vcpu *vcpu); > I will also review Sean's take on this, let see which one is simplier. Best regards, Maxim Levitsky