Re: [PATCH] KVM: nVMX: clear nested_run_pending when emulating invalid guest state

Radim Krcmár <rkrcmar@xxxxxxxxxx> · Tue, 13 Mar 2018 22:59:10 +0100

2018-03-13 13:22-0700, Sean Christopherson:
> On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote:
> > 2018-03-09 23:06+0000, Christopherson, Sean J:
> > > 
> > > On Friday, 2018-03-09, Radim Krčmář wrote:
> > > > 
> > > > 2018-03-05 09:39-0800, Sean Christopherson:
> > > > > 
> > > > > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > > > > emulate_instruction(), i.e. after attempting to emulate at least one
> > > > > instruction.  This fixes an issue where L0 enters an infinite loop if
> > > > > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > > > > way to break the loop is to kill Qemu in L0.
> > > > > 
> > > > >     1. call handle_invalid_guest_state() for L2
> > > > >     2. emulate_instruction() pends an exception, e.g. #UD
> > > > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> > > > >        returns 1
> > > > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> > > > >        intercept the exception and nested_run_pending is true
> > > > >     5. handle_invalid_guest_state() never makes forward progress for
> > > > >        L2 due to the pending exception
> > > > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> > > > >        L1 vCPU trying VMLAUNCH effectively hangs
> > > > > 
> > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > > > > ---
> > > > nested_run_pending signals that we have to execute VMRESUME in order to
> > > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> > > > 
> > > > If we don't let the hardware do it, we need to transfer the state from
> > > > L2's VMCS while doing a nested VM exit for the exception (= behave as if
> > > > we entered the guest and exited).
> > > Right, but by virtue of being in handle_invalid_guest_state() we're
> > > already effectively post-VMRESUME.  handle_invalid_guest_state() is
> > > technically called in the VMEXIT flow, but the initial check against
> > > emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
> > > due to invalid state is distinctly different from emulating due to a
> > > specific VMEXIT.  In the latter case, we're emulating an instruction
> > > that has already been executed/attempted in the guest, whereas in the
> > > invalid guest case we're emulating instructions that haven't been seen
> > > before.
> > On a higher level of abstraction, KVM is emulating CPU just the same.
> > Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate
> > the emulation of the VMLAUNCH instruction nor instructions that come
> > after VMLAUNCH (L2) due to missing hardware features.
> > 
> > The purpose of nested_run_pending is to finish the interrupt injection
> > of VMLAUNCH, which should be done before any L2 instruction is executed.
> > Normal flow uses (L0's) VMRESUME acceleration for that, but that is not
> > available in our case.
> > (Maybe it's possible to accelerate by manipulating VMCS into a valid
> >  state and forcing an immediate VM exit.)
> 
> There is nothing to inject at the time of VMLAUNCH.  L1's VMLAUNCH has
> completed

This seems to be the core of disagreement.  The emulated VMLAUNCH
depends on VMX to inject whatever is in VM_ENTRY_INTR_INFO_FIELD and if
we never do a VMRESUME in L0, then the VMLAUNCH is incomplete.

Maybe I'm missing the point of nested_run_pending -- why do we have it
at all if not to signal that there is a pending injection from vmcs12?

>               we're running/emulating L2 with nothing pending when a
> #UD is encountered in L2.  In the unrestricted guest case, where we
> accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated
> VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD).
> In the restricted guest case, we bailed out of vcpu_run very early
> without clearing nested_run_pending.

Yes, we do not need to inject in this case (which we could detect and
just never set nested_run_pending in the first place), but we need to in
the case where L1 sets pending interrupt in VMCS.

> Here's a comparison of how a #UD in L2 w/ith invalid guest state would
> be handled with unrestricted guest enabled vs. disabled, and why I
> thought it made sense to clear nested_run_pending in
> handle_invalid_guest_state().
> 
> 
> Unrestricted Guest
> 
> H/W  Emu    Action              Result
> --------------------------------------
> L0   *      vcpu_run vmcs01     switch to L1
> L1   *      vcpu_run vmcs12     exit to L0
> L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
> L0   *      vcpu_run vmcs02     switch to L2

The hardware injected an interrupt from vmcs12.

> L2   *      L2 does stuff...    
> L2   *      #UD                 exit to L0
> L0   *      vcpu_run postamble  vmx->nested.nested_run_pending = 0
> L0   *      inject VMEXIT(#UD)  succeeds
> L0   *      vcpu_run vmcs01     switch to L1
> 
> 
> Restricted Guest
> 
> H/W  Emu    Action              Result
> --------------------------------------
> L0   *      vcpu_run vmcs01     switch to L1
> L1   *      vcpu_run vmcs12     exit to L0
> L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
> L0   *      vcpu_run vmcs02     bails due to emulation_required

<<Here>> is no code to inject an interrupt from vmcs12.

> L0   L2     emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12
> L0   L2     L2 does stuff...    
> L0   L2     #UD                 exit to L0 (emulated)
> 
> <this is where I was suggesting we clear nested_run_pending>

But why not earlier, e.g. before emulating instructions in the invalid
state?  I don't understand what nested_run_pending is supposed to do in
this case.  I think the right place to clear nested_run_pending is at
<<Here>>, where we also do what hardware with unrestricted guest does.

Thanks.

> L0   *      inject VMEXIT(#UD)  fails, nested_run_pending is set
> L0   L2     infinite loop...