Re: [PATCH] KVM: nVMX: clear nested_run_pending when emulating invalid guest state

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Tue, 13 Mar 2018 13:22:46 -0700

On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote:
> 2018-03-09 23:06+0000, Christopherson, Sean J:
> > 
> > On Friday, 2018-03-09, Radim Krčmář wrote:
> > > 
> > > 2018-03-05 09:39-0800, Sean Christopherson:
> > > > 
> > > > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > > > emulate_instruction(), i.e. after attempting to emulate at least one
> > > > instruction.  This fixes an issue where L0 enters an infinite loop if
> > > > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > > > way to break the loop is to kill Qemu in L0.
> > > > 
> > > >     1. call handle_invalid_guest_state() for L2
> > > >     2. emulate_instruction() pends an exception, e.g. #UD
> > > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> > > >        returns 1
> > > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> > > >        intercept the exception and nested_run_pending is true
> > > >     5. handle_invalid_guest_state() never makes forward progress for
> > > >        L2 due to the pending exception
> > > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> > > >        L1 vCPU trying VMLAUNCH effectively hangs
> > > > 
> > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > > > ---
> > > nested_run_pending signals that we have to execute VMRESUME in order to
> > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> > > 
> > > If we don't let the hardware do it, we need to transfer the state from
> > > L2's VMCS while doing a nested VM exit for the exception (= behave as if
> > > we entered the guest and exited).
> > Right, but by virtue of being in handle_invalid_guest_state() we're
> > already effectively post-VMRESUME.  handle_invalid_guest_state() is
> > technically called in the VMEXIT flow, but the initial check against
> > emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
> > due to invalid state is distinctly different from emulating due to a
> > specific VMEXIT.  In the latter case, we're emulating an instruction
> > that has already been executed/attempted in the guest, whereas in the
> > invalid guest case we're emulating instructions that haven't been seen
> > before.
> On a higher level of abstraction, KVM is emulating CPU just the same.
> Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate
> the emulation of the VMLAUNCH instruction nor instructions that come
> after VMLAUNCH (L2) due to missing hardware features.
> 
> The purpose of nested_run_pending is to finish the interrupt injection
> of VMLAUNCH, which should be done before any L2 instruction is executed.
> Normal flow uses (L0's) VMRESUME acceleration for that, but that is not
> available in our case.
> (Maybe it's possible to accelerate by manipulating VMCS into a valid
>  state and forcing an immediate VM exit.)

There is nothing to inject at the time of VMLAUNCH.  L1's VMLAUNCH has
completed and we're running/emulating L2 with nothing pending when a
#UD is encountered in L2.  In the unrestricted guest case, where we
accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated
VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD).
In the restricted guest case, we bailed out of vcpu_run very early
without clearing nested_run_pending.

Here's a comparison of how a #UD in L2 w/ith invalid guest state would
be handled with unrestricted guest enabled vs. disabled, and why I
thought it made sense to clear nested_run_pending in
handle_invalid_guest_state().

Unrestricted Guest

H/W  Emu    Action              Result
--------------------------------------
L0   *      vcpu_run vmcs01     switch to L1
L1   *      vcpu_run vmcs12     exit to L0
L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
L0   *      vcpu_run vmcs02     switch to L2
L2   *      L2 does stuff...    
L2   *      #UD                 exit to L0
L0   *      vcpu_run postamble  vmx->nested.nested_run_pending = 0
L0   *      inject VMEXIT(#UD)  succeeds
L0   *      vcpu_run vmcs01     switch to L1

Restricted Guest

H/W  Emu    Action              Result
--------------------------------------
L0   *      vcpu_run vmcs01     switch to L1
L1   *      vcpu_run vmcs12     exit to L0
L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
L0   *      vcpu_run vmcs02     bails due to emulation_required
L0   L2     emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12
L0   L2     L2 does stuff...    
L0   L2     #UD                 exit to L0 (emulated)

<this is where I was suggesting we clear nested_run_pending>

L0   *      inject VMEXIT(#UD)  fails, nested_run_pending is set
L0   L2     infinite loop...

> > 
> >          So in that sense, clearing nested_run_pending when emulating
> > invalid guest state is analogous to clearing nested_run_pending after
> > VMENTER in vmx_vcpu_run().
> Only if we actually do what VMLAUNCH does.  We should finish VMLAUNCH by
> emulating the injection before emulating invalid L2 state.
> 
> > 
> > > 
> > > And I think the actual fix here is to evaluate the interrupt before the
> > > first emulate_instruction() in handle_invalid_guest_state().
> > The issue isn't that we don't evaluate an interrupt/exception, it's that
> > nested_run_pending is never cleared and so inject_pending_event() always
> > returns -EBUSY without injecting the VMEXIT(#UD) to L1.
> nested_run_pending is never cleared because we accelerate injection of
> the interrupt in hardware and therefore need a (L0) VMRESUME.
> 
> I was proposing to emulate the whole VMLAUNCH before emulating L2 guest
> state, which would prevent the -EBUSY as a side effect.
> 
> > 
> > And in the failing case, there is no interrupt to evaluate before the
> > first emulate_instruction(), the #UD is pended by emulate_instruction().
> > Theoretically, any number of exceptions that L1 wants to intercept could
> > be detected while emulating L2.
> A simple fix for this particular case would be to skip setting
> nested_run_pending, because we don't defer injection to (L0) VMRESUME.
> 
> > 
> > All that being said, I only encountered this bug due to the vcms02 sync
> > bug (https://patchwork.kernel.org/patch/10259389/), e.g. L0 incorrectly
> > thought L1 was attempting to run L2 with invalid guest state in vmcs12.
> > Thinking about it more, L0 can only reach handle_invalid_guest_state()
> > in the context of L2 if L0 has a bug, or if L1 attempts to run L2 with
> > invalid state, e.g. L1 knowingly ignores the fact that unrestricted guest
> > is disabled or has a bug of its own.
> Right, it shouldn't be happening on recent hardware.
> 
> > 
> > So, I think the correct fix would be to return 1 from prepare_vmcs02
> > if emulation_required is true, i.e. signal VMEntry failed due to
> > EXIT_REASON_INVALID_STATE, and add a BUG either in vmx_vcpu_run() or
> > handle_invalid_guest_state() that fires if we're emulating L2, i.e.
> > BUG_ON(vmx->emulation_required && vmx->nested.nested_run_pending);
> Sounds good (with a WARN_ONCE like Paolo said, because a guest can
> configure it and KVM should support it),
> 
> thanks.