Re: [PATCH] KVM: nVMX: Fix loss of pending event before entering L2

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Thu, 30 Aug 2018 05:54:53 -0700

On Thu, 2018-08-30 at 03:24 +0300, Liran Alon wrote:
> 
> > 
> > On 29 Aug 2018, at 19:34, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
> > I agree.  When I said I didn't think it was the correct fix, I was
> > thinking that we should propagate the pending interrupt from vmcs01
> > to vmcs02, but I realized that was wrong after analyzing everything
> > for the thousandth time.
> > 
> > So, I agree that the general direction is correct, though I think we
> > can narrow down when we force events to be re-evaluated and also be
> > more explicit in the reasoning.  And this should also override the
> > HALT_STATE handling, e.g. the injecting to L1 will wake the CPU from
> > its halt state.  I think the HALT_STATE case was why I saw L2 failing
> > the preemption unit test.
> > 
> > Hopefully I didn't mangle the patch below...
> > 
> > From 69617481cb5d8813046d32d2b6881e97b88a746e Mon Sep 17 00:00:00 2001
> > 
> > From: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > 
> > Date: Wed, 29 Aug 2018 11:06:14 -0700
> > Subject: [PATCH] KVM: nVMX: re-evaluate events if L1 should get an INTR/NMI after VMEnter
> > 
> > Force re-evaluation of events prior to running vmcs02 if vmcs01 has
> > a pending INTR or NMI and vmcs12 is configured to exit on the event,
> > in which case the event will cause a VMExit to L1 immediately after
> > VMEnter regardless of L2's event blocking.  Re-evaluating events is
> > needed to ensure L0 triggers an immediate VMExit from L2 in order to
> > inject the INTR or NMI into L1.
> > 
> > Cc: stable@xxxxxxxxxxxxxxx
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > 
> > ---
> >  arch/x86/kvm/vmx.c | 27 ++++++++++++++++++++++-----
> >  1 file changed, 22 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 8dae47e7267a..a5395fc39cb2 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -12602,7 +12602,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> >  	struct vmcs12 *vmcs12;
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
> > -	u32 exit_qual;
> > +	u32 exit_qual, vmcs01_cpu_exec_ctrl;
> >  	int ret;
> >  
> >  	if (!nested_vmx_check_permission(vcpu))
> > @@ -12674,8 +12674,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> >  
> >  	/*
> >  	 * We're finally done with prerequisite checking, and can start with
> > -	 * the nested entry.
> > +	 * the nested entry.  Snapshot the CPU-based execution controls from
> > +	 * vmcs01 before loading vmcs02, we'll need them to properly handle
> > +	 * post-VMEnter INTR/NMI injection to L1.
> >  	 */
> > +	vmcs01_cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> >  
> >  	vmx->nested.nested_run_pending = 1;
> >  	ret = enter_vmx_non_root_mode(vcpu, &exit_qual);
> > @@ -12701,11 +12704,25 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> >  	nested_cache_shadow_vmcs12(vcpu, vmcs12);
> >  
> >  	/*
> > -	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
> > -	 * by event injection, halt vcpu.
> > +	 * Force re-evaluation of events prior to running vmcs02 if vmcs01 has
> > +	 * a pending INTR or NMI and vmcs12 is configured to exit on the event,
> > +	 * in which case the event will cause a VMExit to L1 immediately after
> > +	 * VMEnter regardless of L2's event blocking.  Re-evaluating events is
> > +	 * needed to ensure L0 triggers an immediate VMExit from L2 in order to
> > +	 * inject the INTR or NMI into L1.
> >  	 */
> > -	if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
> > +	if (((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) &&
> > +	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_EXT_INTR_MASK)) ||
> > +	    ((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) &&
> > +	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_NMI_EXITING))) {
> > +		kvm_make_request(KVM_REQ_EVENT, vcpu);
> > +	} else if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
> >  	    !(vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK)) {
> > +		/*
> > +		 * Halt the vCPU if we're entering a halted L2 vCPU and the L2
> > +		 * vCPU won't be woken by an injected event, e.g. VOE to L2 or
> > +		 * INTR/NMI to L1.
> > +		 */
> >  		vmx->nested.nested_run_pending = 0;
> >  		return kvm_vcpu_halt(vcpu);
> >  	}
> > 
> (Sorry for opening a new email thread. Something screwed up in my email client…)
> 
> I see a couple of issues with the patch you suggest:
> 
> 1. When L1 don’t intercept external-interrupts, interrupt should still be injected after vmlaunch/vmresume but directly to L2.
> Therefore, we should remove the tests against vmcs12->pin_based_vm_exec_control bits.
> (KVM_REQ_EVENT will handle this direct injection to L2 correctly)

Right, not sure why I brought vmcs12 into the picture.  The behavior
we're enforcing is the VMExit to L0, everything after that is handled
by whatever normally L0 does after the associated VMExit.

> 2. The logic of deciding if we should set KVM_REQ_EVENT should be inside enter_vmx_non_root_mode() and not nested_vmx_run().
> This is because otherwise, you could reach a buggy scenario after nVMX migration which sets nVMX state directly from vmx_set_nested_state().

In that case I think we'd want to nested_run_pending, i.e. entry from
SMM shouldn't set KVM_REQ_EVENT.  If nested_run_pending==false and one
of the pending bits is sets (in the migration case) we already goofed.

> As I agree with you regarding your other comments, I will create a v2 to my patch and submit it. :)
> (Together with the relevant kvm-unit-test)
> 
> Thanks,
> -Liran
> 
>