Re: [PATCH] KVM: nVMX: Fix loss of pending event before entering L2

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Wed, 29 Aug 2018 11:34:52 -0700

On Wed, Aug 29, 2018 at 08:12:43PM +0300, Liran Alon wrote:
> 
> > On 29 Aug 2018, at 19:39, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote:
> > 
> > 2018-08-29 09:34-0700, Sean Christopherson:
> >> On Wed, Aug 29, 2018 at 06:18:20PM +0200, Radim Krčmář wrote:
> >>> 2018-08-29 18:43+0300, Liran Alon:
> >>>> Consider the case L1 had a pending event until it executed
> >>>> VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
> >>>> (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
> >>>> L0 needs to evaluate if this pending event should cause an exit from
> >>>> L2 to L1 or delivered directly to L2 (In case L1 don't intercept
> >>>> EXTERNAL_INTERRUPT).
> >>>> 
> >>>> Usually this would be handled by L0 requesting a window (e.g. IRQ
> >>>> window) by setting VMCS accordingly. However, this setting was done on
> >>>> VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
> >>>> VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
> >>>> requesting a KVM_REQ_EVENT.
> >>>> 
> >>>> Note that above scenario exists when L1 KVM is about to enter L2 but
> >>>> requests an "immediate-exit". As in this case, L1 will
> >>>> disable-interrupts and then send a self-IPI before entering L2.
> >>> 
> >>> Which makes it a big blunder, I'll add "Cc: stable@xxxxxxxxxxxxxxx".
> >> 
> >> Please hold off on doing anything with this, I don't think this is the
> >> correct fix.  I have a half-finished response to the preemption timer
> >> thread that prompted this patch, I'll get that sent ASAP. 
> > 
> > Sure, thanks for the heads-up.
> 
> Sean, I think this is orthogonal to the “immediate-exit” mechanism implementation issue
> you suggest to replace with preemption-timer with interval of 0 instead of self-IPI.

I agree.  When I said I didn't think it was the correct fix, I was
thinking that we should propagate the pending interrupt from vmcs01
to vmcs02, but I realized that was wrong after analyzing everything
for the thousandth time.

So, I agree that the general direction is correct, though I think we
can narrow down when we force events to be re-evaluated and also be
more explicit in the reasoning.  And this should also override the
HALT_STATE handling, e.g. the injecting to L1 will wake the CPU from
its halt state.  I think the HALT_STATE case was why I saw L2 failing
the preemption unit test.

Hopefully I didn't mangle the patch below...

>From 69617481cb5d8813046d32d2b6881e97b88a746e Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
Date: Wed, 29 Aug 2018 11:06:14 -0700
Subject: [PATCH] KVM: nVMX: re-evaluate events if L1 should get an INTR/NMI after VMEnter

Force re-evaluation of events prior to running vmcs02 if vmcs01 has
a pending INTR or NMI and vmcs12 is configured to exit on the event,
in which case the event will cause a VMExit to L1 immediately after
VMEnter regardless of L2's event blocking.  Re-evaluating events is
needed to ensure L0 triggers an immediate VMExit from L2 in order to
inject the INTR or NMI into L1.

Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
---
 arch/x86/kvm/vmx.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8dae47e7267a..a5395fc39cb2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -12602,7 +12602,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	struct vmcs12 *vmcs12;
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
-	u32 exit_qual;
+	u32 exit_qual, vmcs01_cpu_exec_ctrl;
 	int ret;
 
 	if (!nested_vmx_check_permission(vcpu))
@@ -12674,8 +12674,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 
 	/*
 	 * We're finally done with prerequisite checking, and can start with
-	 * the nested entry.
+	 * the nested entry.  Snapshot the CPU-based execution controls from
+	 * vmcs01 before loading vmcs02, we'll need them to properly handle
+	 * post-VMEnter INTR/NMI injection to L1.
 	 */
+	vmcs01_cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 
 	vmx->nested.nested_run_pending = 1;
 	ret = enter_vmx_non_root_mode(vcpu, &exit_qual);
@@ -12701,11 +12704,25 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	nested_cache_shadow_vmcs12(vcpu, vmcs12);
 
 	/*
-	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
-	 * by event injection, halt vcpu.
+	 * Force re-evaluation of events prior to running vmcs02 if vmcs01 has
+	 * a pending INTR or NMI and vmcs12 is configured to exit on the event,
+	 * in which case the event will cause a VMExit to L1 immediately after
+	 * VMEnter regardless of L2's event blocking.  Re-evaluating events is
+	 * needed to ensure L0 triggers an immediate VMExit from L2 in order to
+	 * inject the INTR or NMI into L1.
 	 */
-	if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
+	if (((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) &&
+	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_EXT_INTR_MASK)) ||
+	    ((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) &&
+	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_NMI_EXITING))) {
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+	} else if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
 	    !(vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK)) {
+		/*
+		 * Halt the vCPU if we're entering a halted L2 vCPU and the L2
+		 * vCPU won't be woken by an injected event, e.g. VOE to L2 or
+		 * INTR/NMI to L1.
+		 */
 		vmx->nested.nested_run_pending = 0;
 		return kvm_vcpu_halt(vcpu);
 	}
-- 
2.18.0

> In my opinion, this patch handles a general issue of losing pending interrupt queued 
> (And disallowed from being dispatched) in L1 before entering L2. This is not just related
> to immediate-exit mechanism. This is also true for example for a timer-interrupt that may be
> raised L1 during the timespan in which L1 disables interrupts until he VMRESUME into L2.
> 
> I have actually written a small effective kvm-unit-test for this. It fails before this patch and passes after it.
> I will submit the unit-test and Cc you guys.
> 
> -Liran
> 
>