RE: [PATCH] KVM: nVMX: clear nested_run_pending when emulating invalid guest state

"Christopherson, Sean J" <sean.j.christopherson@xxxxxxxxx> · Fri, 9 Mar 2018 23:06:20 +0000

On Friday, 2018-03-09, Radim Krčmář wrote:
> 2018-03-05 09:39-0800, Sean Christopherson:
> > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > emulate_instruction(), i.e. after attempting to emulate at least one
> > instruction.  This fixes an issue where L0 enters an infinite loop if
> > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > way to break the loop is to kill Qemu in L0.
> > 
> >     1. call handle_invalid_guest_state() for L2
> >     2. emulate_instruction() pends an exception, e.g. #UD
> >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> >        returns 1
> >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> >        intercept the exception and nested_run_pending is true
> >     5. handle_invalid_guest_state() never makes forward progress for
> >        L2 due to the pending exception
> >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> >        L1 vCPU trying VMLAUNCH effectively hangs
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > ---
> 
> nested_run_pending signals that we have to execute VMRESUME in order to
> do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> 
> If we don't let the hardware do it, we need to transfer the state from
> L2's VMCS while doing a nested VM exit for the exception (= behave as if
> we entered the guest and exited).

Right, but by virtue of being in handle_invalid_guest_state() we're
already effectively post-VMRESUME.  handle_invalid_guest_state() is
technically called in the VMEXIT flow, but the initial check against
emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
due to invalid state is distinctly different from emulating due to a
specific VMEXIT.  In the latter case, we're emulating an instruction
that has already been executed/attempted in the guest, whereas in the
invalid guest case we're emulating instructions that haven't been seen
before.  So in that sense, clearing nested_run_pending when emulating
invalid guest state is analogous to clearing nested_run_pending after
VMENTER in vmx_vcpu_run().

> And I think the actual fix here is to evaluate the interrupt before the
> first emulate_instruction() in handle_invalid_guest_state().

The issue isn't that we don't evaluate an interrupt/exception, it's that
nested_run_pending is never cleared and so inject_pending_event() always
returns -EBUSY without injecting the VMEXIT(#UD) to L1.

And in the failing case, there is no interrupt to evaluate before the
first emulate_instruction(), the #UD is pended by emulate_instruction().
Theoretically, any number of exceptions that L1 wants to intercept could
be detected while emulating L2.

All that being said, I only encountered this bug due to the vcms02 sync
bug (https://patchwork.kernel.org/patch/10259389/), e.g. L0 incorrectly
thought L1 was attempting to run L2 with invalid guest state in vmcs12.
Thinking about it more, L0 can only reach handle_invalid_guest_state()
in the context of L2 if L0 has a bug, or if L1 attempts to run L2 with
invalid state, e.g. L1 knowingly ignores the fact that unrestricted guest
is disabled or has a bug of its own.

So, I think the correct fix would be to return 1 from prepare_vmcs02
if emulation_required is true, i.e. signal VMEntry failed due to
EXIT_REASON_INVALID_STATE, and add a BUG either in vmx_vcpu_run() or
handle_invalid_guest_state() that fires if we're emulating L2, i.e.
BUG_ON(vmx->emulation_required && vmx->nested.nested_run_pending);

> Do you want to look deeper into this?
> 
> Thanks.
> 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 591214843046..3073160e6bae 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -6835,6 +6835,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> >  
> >  		err = emulate_instruction(vcpu, 0);
> >  
> > +		vmx->nested.nested_run_pending = 0;
> > +
> >  		if (err == EMULATE_USER_EXIT) {
> >  			++vcpu->stat.mmio_exits;
> >  			ret = 0;
> > -- 
> > 2.16.2
> > 
>