RE: [PATCH] KVM: nVMX: clear nested_run_pending when emulating invalid guest state

"Christopherson, Sean J" <sean.j.christopherson@xxxxxxxxx> · Wed, 14 Mar 2018 15:22:13 +0000

On Tue, 2018-03-13 at 22:59 +0100, Radim Krcmár wrote:
> 2018-03-13 13:22-0700, Sean Christopherson:
> > 
> > On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote:
> > > 
> > > 2018-03-09 23:06+0000, Christopherson, Sean J:
> > > > 
> > > > 
> > > > On Friday, 2018-03-09, Radim Krčmář wrote:
> > > > > 
> > > > > 
> > > > > 2018-03-05 09:39-0800, Sean Christopherson:
> > > > > > 
> > > > > > 
> > > > > > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > > > > > emulate_instruction(), i.e. after attempting to emulate at least one
> > > > > > instruction.  This fixes an issue where L0 enters an infinite loop if
> > > > > > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > > > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > > > > > way to break the loop is to kill Qemu in L0.
> > > > > > 
> > > > > >     1. call handle_invalid_guest_state() for L2
> > > > > >     2. emulate_instruction() pends an exception, e.g. #UD
> > > > > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> > > > > >        returns 1
> > > > > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> > > > > >        intercept the exception and nested_run_pending is true
> > > > > >     5. handle_invalid_guest_state() never makes forward progress for
> > > > > >        L2 due to the pending exception
> > > > > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> > > > > >        L1 vCPU trying VMLAUNCH effectively hangs
> > > > > > 
> > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > > > > > ---
> > > > > nested_run_pending signals that we have to execute VMRESUME in order to
> > > > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> > > > > 
> > > > > If we don't let the hardware do it, we need to transfer the state from
> > > > > L2's VMCS while doing a nested VM exit for the exception (= behave as if
> > > > > we entered the guest and exited).
> > > > Right, but by virtue of being in handle_invalid_guest_state() we're
> > > > already effectively post-VMRESUME.  handle_invalid_guest_state() is
> > > > technically called in the VMEXIT flow, but the initial check against
> > > > emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
> > > > due to invalid state is distinctly different from emulating due to a
> > > > specific VMEXIT.  In the latter case, we're emulating an instruction
> > > > that has already been executed/attempted in the guest, whereas in the
> > > > invalid guest case we're emulating instructions that haven't been seen
> > > > before.
> > > On a higher level of abstraction, KVM is emulating CPU just the same.
> > > Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate
> > > the emulation of the VMLAUNCH instruction nor instructions that come
> > > after VMLAUNCH (L2) due to missing hardware features.
> > > 
> > > The purpose of nested_run_pending is to finish the interrupt injection
> > > of VMLAUNCH, which should be done before any L2 instruction is executed.
> > > Normal flow uses (L0's) VMRESUME acceleration for that, but that is not
> > > available in our case.
> > > (Maybe it's possible to accelerate by manipulating VMCS into a valid
> > >  state and forcing an immediate VM exit.)
> > There is nothing to inject at the time of VMLAUNCH.  L1's VMLAUNCH has
> > completed
> This seems to be the core of disagreement.  The emulated VMLAUNCH
> depends on VMX to inject whatever is in VM_ENTRY_INTR_INFO_FIELD and if
> we never do a VMRESUME in L0, then the VMLAUNCH is incomplete.

Ah, I see where you're coming from.  Digging a little deeper, I think
the underlying issue is that handle_invalid_guest_state() doesn't
properly handle exceptions that are pended by emulate_instruction().
Regardless of whether we're emulating L1 or L2, we'll enter an infinite
loop if an emulation_required is true and an exception is pending.  As
you said, we rely on VMRESUME to inject the exception, but we'll never
execute VMRESUME if emulation_required is true (and it would fail due
to invalid guest state if we tried to force the injection).  I verified
this by generating a #GP while emulating Big RM for L1; KVM soft hangs
and never returns to userspace.

I think the fix would be to mirror what handle_emulation_failure()
does if it gets a #UD at CPL0 and return KVM_INTERNAL_ERROR_EMULATION
to userspace if an exception is pending and emulation_required is true.
Note that a #UD in L1 will already do this because of the aforementioned
handle_emulation_failure().

> Maybe I'm missing the point of nested_run_pending -- why do we have it
> at all if not to signal that there is a pending injection from vmcs12?

My understanding is that nested_run_pending is quite literal: there is
a nested VM that needs to be run, do that next.  I don't think it's
directly related to a pending injection, but rather is queried to know
what it is or isn't legal, e.g. in this case injecting a VMEXIT into L1
isn't legal because we need to run L2 next.

> > 
> >               we're running/emulating L2 with nothing pending when a
> > #UD is encountered in L2.  In the unrestricted guest case, where we
> > accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated
> > VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD).
> > In the restricted guest case, we bailed out of vcpu_run very early
> > without clearing nested_run_pending.
> Yes, we do not need to inject in this case (which we could detect and
> just never set nested_run_pending in the first place), but we need to in
> the case where L1 sets pending interrupt in VMCS.
> 
> > 
> > Here's a comparison of how a #UD in L2 w/ith invalid guest state would
> > be handled with unrestricted guest enabled vs. disabled, and why I
> > thought it made sense to clear nested_run_pending in
> > handle_invalid_guest_state().
> > 
> > 
> > Unrestricted Guest
> > 
> > H/W  Emu    Action              Result
> > --------------------------------------
> > L0   *      vcpu_run vmcs01     switch to L1
> > L1   *      vcpu_run vmcs12     exit to L0
> > L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
> > L0   *      vcpu_run vmcs02     switch to L2
> The hardware injected an interrupt from vmcs12.
> 
> > 
> > L2   *      L2 does stuff...    
> > L2   *      #UD                 exit to L0
> > L0   *      vcpu_run postamble  vmx->nested.nested_run_pending = 0
> > L0   *      inject VMEXIT(#UD)  succeeds
> > L0   *      vcpu_run vmcs01     switch to L1
> > 
> > 
> > Restricted Guest
> > 
> > H/W  Emu    Action              Result
> > --------------------------------------
> > L0   *      vcpu_run vmcs01     switch to L1
> > L1   *      vcpu_run vmcs12     exit to L0
> > L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
> > L0   *      vcpu_run vmcs02     bails due to emulation_required
> <<Here>> is no code to inject an interrupt from vmcs12.
> 
> > 
> > L0   L2     emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12
> > L0   L2     L2 does stuff...    
> > L0   L2     #UD                 exit to L0 (emulated)
> > 
> > <this is where I was suggesting we clear nested_run_pending>
> But why not earlier, e.g. before emulating instructions in the invalid
> state?  I don't understand what nested_run_pending is supposed to do in
> this case.  I think the right place to clear nested_run_pending is at
> <<Here>>, where we also do what hardware with unrestricted guest does.

In an alternate universe where we weren't fixing the underlying bugs,
I would have no qualms about clearing nested_run_pending earlier.  I put
it after emulate_instruction because I was trying to be conservative,
i.e. I wanted to put it in a location where we could say without a doubt
that we "ran" L2.

> Thanks.
> 
> > 
> > L0   *      inject VMEXIT(#UD)  fails, nested_run_pending is set
> > L0   L2     infinite loop...