On Friday, 2018-03-09, Radim Krčmář wrote: > 2018-03-05 09:39-0800, Sean Christopherson: > > Clear nested_run_pending in handle_invalid_guest_state() after calling > > emulate_instruction(), i.e. after attempting to emulate at least one > > instruction. This fixes an issue where L0 enters an infinite loop if > > L2 hits an exception that is intercepted by L1 while L0 is emulating > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only > > way to break the loop is to kill Qemu in L0. > > > > 1. call handle_invalid_guest_state() for L2 > > 2. emulate_instruction() pends an exception, e.g. #UD > > 3. L1 intercepts the exception, i.e. nested_vmx_check_exception > > returns 1 > > 4. vmx_check_nested_events() returns -EBUSY because L1 wants to > > intercept the exception and nested_run_pending is true > > 5. handle_invalid_guest_state() never makes forward progress for > > L2 due to the pending exception > > 6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the > > L1 vCPU trying VMLAUNCH effectively hangs > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > > --- > > nested_run_pending signals that we have to execute VMRESUME in order to > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD). > > If we don't let the hardware do it, we need to transfer the state from > L2's VMCS while doing a nested VM exit for the exception (= behave as if > we entered the guest and exited). Right, but by virtue of being in handle_invalid_guest_state() we're already effectively post-VMRESUME. handle_invalid_guest_state() is technically called in the VMEXIT flow, but the initial check against emulation_required is done in vmx_vcpu_run(). Conceptually, emulating due to invalid state is distinctly different from emulating due to a specific VMEXIT. In the latter case, we're emulating an instruction that has already been executed/attempted in the guest, whereas in the invalid guest case we're emulating instructions that haven't been seen before. So in that sense, clearing nested_run_pending when emulating invalid guest state is analogous to clearing nested_run_pending after VMENTER in vmx_vcpu_run(). > And I think the actual fix here is to evaluate the interrupt before the > first emulate_instruction() in handle_invalid_guest_state(). The issue isn't that we don't evaluate an interrupt/exception, it's that nested_run_pending is never cleared and so inject_pending_event() always returns -EBUSY without injecting the VMEXIT(#UD) to L1. And in the failing case, there is no interrupt to evaluate before the first emulate_instruction(), the #UD is pended by emulate_instruction(). Theoretically, any number of exceptions that L1 wants to intercept could be detected while emulating L2. All that being said, I only encountered this bug due to the vcms02 sync bug (https://patchwork.kernel.org/patch/10259389/), e.g. L0 incorrectly thought L1 was attempting to run L2 with invalid guest state in vmcs12. Thinking about it more, L0 can only reach handle_invalid_guest_state() in the context of L2 if L0 has a bug, or if L1 attempts to run L2 with invalid state, e.g. L1 knowingly ignores the fact that unrestricted guest is disabled or has a bug of its own. So, I think the correct fix would be to return 1 from prepare_vmcs02 if emulation_required is true, i.e. signal VMEntry failed due to EXIT_REASON_INVALID_STATE, and add a BUG either in vmx_vcpu_run() or handle_invalid_guest_state() that fires if we're emulating L2, i.e. BUG_ON(vmx->emulation_required && vmx->nested.nested_run_pending); > Do you want to look deeper into this? > > Thanks. > > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > index 591214843046..3073160e6bae 100644 > > --- a/arch/x86/kvm/vmx.c > > +++ b/arch/x86/kvm/vmx.c > > @@ -6835,6 +6835,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) > > > > err = emulate_instruction(vcpu, 0); > > > > + vmx->nested.nested_run_pending = 0; > > + > > if (err == EMULATE_USER_EXIT) { > > ++vcpu->stat.mmio_exits; > > ret = 0; > > -- > > 2.16.2 > > >