On Wed, Aug 29, 2018 at 11:34:52AM -0700, Sean Christopherson wrote: > On Wed, Aug 29, 2018 at 08:12:43PM +0300, Liran Alon wrote: > > > > > On 29 Aug 2018, at 19:39, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote: > > > > > > 2018-08-29 09:34-0700, Sean Christopherson: > > >> On Wed, Aug 29, 2018 at 06:18:20PM +0200, Radim Krčmář wrote: > > >>> 2018-08-29 18:43+0300, Liran Alon: > > >>>> Consider the case L1 had a pending event until it executed > > >>>> VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed > > >>>> (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME, > > >>>> L0 needs to evaluate if this pending event should cause an exit from > > >>>> L2 to L1 or delivered directly to L2 (In case L1 don't intercept > > >>>> EXTERNAL_INTERRUPT). > > >>>> > > >>>> Usually this would be handled by L0 requesting a window (e.g. IRQ > > >>>> window) by setting VMCS accordingly. However, this setting was done on > > >>>> VMCS01 and now VMCS02 is active instead. Thus, when L1 executes > > >>>> VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by > > >>>> requesting a KVM_REQ_EVENT. > > >>>> > > >>>> Note that above scenario exists when L1 KVM is about to enter L2 but > > >>>> requests an "immediate-exit". As in this case, L1 will > > >>>> disable-interrupts and then send a self-IPI before entering L2. > > >>> > > >>> Which makes it a big blunder, I'll add "Cc: stable@xxxxxxxxxxxxxxx". > > >> > > >> Please hold off on doing anything with this, I don't think this is the > > >> correct fix. I have a half-finished response to the preemption timer > > >> thread that prompted this patch, I'll get that sent ASAP. > > > > > > Sure, thanks for the heads-up. > > > > Sean, I think this is orthogonal to the “immediate-exit” mechanism implementation issue > > you suggest to replace with preemption-timer with interval of 0 instead of self-IPI. > > I agree. When I said I didn't think it was the correct fix, I was > thinking that we should propagate the pending interrupt from vmcs01 > to vmcs02, but I realized that was wrong after analyzing everything > for the thousandth time. > > So, I agree that the general direction is correct, though I think we > can narrow down when we force events to be re-evaluated and also be > more explicit in the reasoning. And this should also override the > HALT_STATE handling, e.g. the injecting to L1 will wake the CPU from > its halt state. I think the HALT_STATE case was why I saw L2 failing > the preemption unit test. > > Hopefully I didn't mangle the patch below... > > From 69617481cb5d8813046d32d2b6881e97b88a746e Mon Sep 17 00:00:00 2001 > From: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > Date: Wed, 29 Aug 2018 11:06:14 -0700 > Subject: [PATCH] KVM: nVMX: re-evaluate events if L1 should get an INTR/NMI after VMEnter > > Force re-evaluation of events prior to running vmcs02 if vmcs01 has > a pending INTR or NMI and vmcs12 is configured to exit on the event, > in which case the event will cause a VMExit to L1 immediately after > VMEnter regardless of L2's event blocking. Re-evaluating events is > needed to ensure L0 triggers an immediate VMExit from L2 in order to > inject the INTR or NMI into L1. > > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> Doh, this obviously should have a Suggested-by or Reported-by for Liran. > --- > arch/x86/kvm/vmx.c | 27 ++++++++++++++++++++++----- > 1 file changed, 22 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 8dae47e7267a..a5395fc39cb2 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -12602,7 +12602,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) > struct vmcs12 *vmcs12; > struct vcpu_vmx *vmx = to_vmx(vcpu); > u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu); > - u32 exit_qual; > + u32 exit_qual, vmcs01_cpu_exec_ctrl; > int ret; > > if (!nested_vmx_check_permission(vcpu)) > @@ -12674,8 +12674,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) > > /* > * We're finally done with prerequisite checking, and can start with > - * the nested entry. > + * the nested entry. Snapshot the CPU-based execution controls from > + * vmcs01 before loading vmcs02, we'll need them to properly handle > + * post-VMEnter INTR/NMI injection to L1. > */ > + vmcs01_cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); > > vmx->nested.nested_run_pending = 1; > ret = enter_vmx_non_root_mode(vcpu, &exit_qual); > @@ -12701,11 +12704,25 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) > nested_cache_shadow_vmcs12(vcpu, vmcs12); > > /* > - * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken > - * by event injection, halt vcpu. > + * Force re-evaluation of events prior to running vmcs02 if vmcs01 has > + * a pending INTR or NMI and vmcs12 is configured to exit on the event, > + * in which case the event will cause a VMExit to L1 immediately after > + * VMEnter regardless of L2's event blocking. Re-evaluating events is > + * needed to ensure L0 triggers an immediate VMExit from L2 in order to > + * inject the INTR or NMI into L1. > */ > - if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) && > + if (((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) && > + (vmcs12->pin_based_vm_exec_control & PIN_BASED_EXT_INTR_MASK)) || > + ((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) && > + (vmcs12->pin_based_vm_exec_control & PIN_BASED_NMI_EXITING))) { > + kvm_make_request(KVM_REQ_EVENT, vcpu); > + } else if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) && > !(vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK)) { > + /* > + * Halt the vCPU if we're entering a halted L2 vCPU and the L2 > + * vCPU won't be woken by an injected event, e.g. VOE to L2 or > + * INTR/NMI to L1. > + */ > vmx->nested.nested_run_pending = 0; > return kvm_vcpu_halt(vcpu); > } > -- > 2.18.0 > > > In my opinion, this patch handles a general issue of losing pending interrupt queued > > (And disallowed from being dispatched) in L1 before entering L2. This is not just related > > to immediate-exit mechanism. This is also true for example for a timer-interrupt that may be > > raised L1 during the timespan in which L1 disables interrupts until he VMRESUME into L2. > > > > I have actually written a small effective kvm-unit-test for this. It fails before this patch and passes after it. > > I will submit the unit-test and Cc you guys. > > > > -Liran > > > >