Re: [PATCH] KVM: nVMX: clear nested_run_pending when emulating invalid guest state

Radim Krcmár <rkrcmar@xxxxxxxxxx> · Tue, 13 Mar 2018 19:42:44 +0100

2018-03-09 23:06+0000, Christopherson, Sean J:
> On Friday, 2018-03-09, Radim Krčmář wrote:
> > 2018-03-05 09:39-0800, Sean Christopherson:
> > > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > > emulate_instruction(), i.e. after attempting to emulate at least one
> > > instruction.  This fixes an issue where L0 enters an infinite loop if
> > > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > > way to break the loop is to kill Qemu in L0.
> > > 
> > >     1. call handle_invalid_guest_state() for L2
> > >     2. emulate_instruction() pends an exception, e.g. #UD
> > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> > >        returns 1
> > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> > >        intercept the exception and nested_run_pending is true
> > >     5. handle_invalid_guest_state() never makes forward progress for
> > >        L2 due to the pending exception
> > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> > >        L1 vCPU trying VMLAUNCH effectively hangs
> > > 
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > > ---
> > 
> > nested_run_pending signals that we have to execute VMRESUME in order to
> > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> > 
> > If we don't let the hardware do it, we need to transfer the state from
> > L2's VMCS while doing a nested VM exit for the exception (= behave as if
> > we entered the guest and exited).
> 
> Right, but by virtue of being in handle_invalid_guest_state() we're
> already effectively post-VMRESUME.  handle_invalid_guest_state() is
> technically called in the VMEXIT flow, but the initial check against
> emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
> due to invalid state is distinctly different from emulating due to a
> specific VMEXIT.  In the latter case, we're emulating an instruction
> that has already been executed/attempted in the guest, whereas in the
> invalid guest case we're emulating instructions that haven't been seen
> before.

On a higher level of abstraction, KVM is emulating CPU just the same.
Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate
the emulation of the VMLAUNCH instruction nor instructions that come
after VMLAUNCH (L2) due to missing hardware features.

The purpose of nested_run_pending is to finish the interrupt injection
of VMLAUNCH, which should be done before any L2 instruction is executed.
Normal flow uses (L0's) VMRESUME acceleration for that, but that is not
available in our case.
(Maybe it's possible to accelerate by manipulating VMCS into a valid
 state and forcing an immediate VM exit.)

>          So in that sense, clearing nested_run_pending when emulating
> invalid guest state is analogous to clearing nested_run_pending after
> VMENTER in vmx_vcpu_run().

Only if we actually do what VMLAUNCH does.  We should finish VMLAUNCH by
emulating the injection before emulating invalid L2 state.

> > And I think the actual fix here is to evaluate the interrupt before the
> > first emulate_instruction() in handle_invalid_guest_state().
> 
> The issue isn't that we don't evaluate an interrupt/exception, it's that
> nested_run_pending is never cleared and so inject_pending_event() always
> returns -EBUSY without injecting the VMEXIT(#UD) to L1.

nested_run_pending is never cleared because we accelerate injection of
the interrupt in hardware and therefore need a (L0) VMRESUME.

I was proposing to emulate the whole VMLAUNCH before emulating L2 guest
state, which would prevent the -EBUSY as a side effect.

> And in the failing case, there is no interrupt to evaluate before the
> first emulate_instruction(), the #UD is pended by emulate_instruction().
> Theoretically, any number of exceptions that L1 wants to intercept could
> be detected while emulating L2.

A simple fix for this particular case would be to skip setting
nested_run_pending, because we don't defer injection to (L0) VMRESUME.

> All that being said, I only encountered this bug due to the vcms02 sync
> bug (https://patchwork.kernel.org/patch/10259389/), e.g. L0 incorrectly
> thought L1 was attempting to run L2 with invalid guest state in vmcs12.
> Thinking about it more, L0 can only reach handle_invalid_guest_state()
> in the context of L2 if L0 has a bug, or if L1 attempts to run L2 with
> invalid state, e.g. L1 knowingly ignores the fact that unrestricted guest
> is disabled or has a bug of its own.

Right, it shouldn't be happening on recent hardware.

> So, I think the correct fix would be to return 1 from prepare_vmcs02
> if emulation_required is true, i.e. signal VMEntry failed due to
> EXIT_REASON_INVALID_STATE, and add a BUG either in vmx_vcpu_run() or
> handle_invalid_guest_state() that fires if we're emulating L2, i.e.
> BUG_ON(vmx->emulation_required && vmx->nested.nested_run_pending);

Sounds good (with a WARN_ONCE like Paolo said, because a guest can
configure it and KVM should support it),

thanks.