On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote: > 2018-03-09 23:06+0000, Christopherson, Sean J: > > > > On Friday, 2018-03-09, Radim Krčmář wrote: > > > > > > 2018-03-05 09:39-0800, Sean Christopherson: > > > > > > > > Clear nested_run_pending in handle_invalid_guest_state() after calling > > > > emulate_instruction(), i.e. after attempting to emulate at least one > > > > instruction. This fixes an issue where L0 enters an infinite loop if > > > > L2 hits an exception that is intercepted by L1 while L0 is emulating > > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only > > > > way to break the loop is to kill Qemu in L0. > > > > > > > > 1. call handle_invalid_guest_state() for L2 > > > > 2. emulate_instruction() pends an exception, e.g. #UD > > > > 3. L1 intercepts the exception, i.e. nested_vmx_check_exception > > > > returns 1 > > > > 4. vmx_check_nested_events() returns -EBUSY because L1 wants to > > > > intercept the exception and nested_run_pending is true > > > > 5. handle_invalid_guest_state() never makes forward progress for > > > > L2 due to the pending exception > > > > 6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the > > > > L1 vCPU trying VMLAUNCH effectively hangs > > > > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > > > > --- > > > nested_run_pending signals that we have to execute VMRESUME in order to > > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD). > > > > > > If we don't let the hardware do it, we need to transfer the state from > > > L2's VMCS while doing a nested VM exit for the exception (= behave as if > > > we entered the guest and exited). > > Right, but by virtue of being in handle_invalid_guest_state() we're > > already effectively post-VMRESUME. handle_invalid_guest_state() is > > technically called in the VMEXIT flow, but the initial check against > > emulation_required is done in vmx_vcpu_run(). Conceptually, emulating > > due to invalid state is distinctly different from emulating due to a > > specific VMEXIT. In the latter case, we're emulating an instruction > > that has already been executed/attempted in the guest, whereas in the > > invalid guest case we're emulating instructions that haven't been seen > > before. > On a higher level of abstraction, KVM is emulating CPU just the same. > Here, we're starting by emulating L1's VMLAUNCH. KVM cannot accelerate > the emulation of the VMLAUNCH instruction nor instructions that come > after VMLAUNCH (L2) due to missing hardware features. > > The purpose of nested_run_pending is to finish the interrupt injection > of VMLAUNCH, which should be done before any L2 instruction is executed. > Normal flow uses (L0's) VMRESUME acceleration for that, but that is not > available in our case. > (Maybe it's possible to accelerate by manipulating VMCS into a valid > state and forcing an immediate VM exit.) There is nothing to inject at the time of VMLAUNCH. L1's VMLAUNCH has completed and we're running/emulating L2 with nothing pending when a #UD is encountered in L2. In the unrestricted guest case, where we accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD). In the restricted guest case, we bailed out of vcpu_run very early without clearing nested_run_pending. Here's a comparison of how a #UD in L2 w/ith invalid guest state would be handled with unrestricted guest enabled vs. disabled, and why I thought it made sense to clear nested_run_pending in handle_invalid_guest_state(). Unrestricted Guest H/W Emu Action Result -------------------------------------- L0 * vcpu_run vmcs01 switch to L1 L1 * vcpu_run vmcs12 exit to L0 L0 * nested_vmx_run vmx->nested.nested_run_pending = 1 L0 * vcpu_run vmcs02 switch to L2 L2 * L2 does stuff... L2 * #UD exit to L0 L0 * vcpu_run postamble vmx->nested.nested_run_pending = 0 L0 * inject VMEXIT(#UD) succeeds L0 * vcpu_run vmcs01 switch to L1 Restricted Guest H/W Emu Action Result -------------------------------------- L0 * vcpu_run vmcs01 switch to L1 L1 * vcpu_run vmcs12 exit to L0 L0 * nested_vmx_run vmx->nested.nested_run_pending = 1 L0 * vcpu_run vmcs02 bails due to emulation_required L0 L2 emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12 L0 L2 L2 does stuff... L0 L2 #UD exit to L0 (emulated) <this is where I was suggesting we clear nested_run_pending> L0 * inject VMEXIT(#UD) fails, nested_run_pending is set L0 L2 infinite loop... > > > > So in that sense, clearing nested_run_pending when emulating > > invalid guest state is analogous to clearing nested_run_pending after > > VMENTER in vmx_vcpu_run(). > Only if we actually do what VMLAUNCH does. We should finish VMLAUNCH by > emulating the injection before emulating invalid L2 state. > > > > > > > > > And I think the actual fix here is to evaluate the interrupt before the > > > first emulate_instruction() in handle_invalid_guest_state(). > > The issue isn't that we don't evaluate an interrupt/exception, it's that > > nested_run_pending is never cleared and so inject_pending_event() always > > returns -EBUSY without injecting the VMEXIT(#UD) to L1. > nested_run_pending is never cleared because we accelerate injection of > the interrupt in hardware and therefore need a (L0) VMRESUME. > > I was proposing to emulate the whole VMLAUNCH before emulating L2 guest > state, which would prevent the -EBUSY as a side effect. > > > > > And in the failing case, there is no interrupt to evaluate before the > > first emulate_instruction(), the #UD is pended by emulate_instruction(). > > Theoretically, any number of exceptions that L1 wants to intercept could > > be detected while emulating L2. > A simple fix for this particular case would be to skip setting > nested_run_pending, because we don't defer injection to (L0) VMRESUME. > > > > > All that being said, I only encountered this bug due to the vcms02 sync > > bug (https://patchwork.kernel.org/patch/10259389/), e.g. L0 incorrectly > > thought L1 was attempting to run L2 with invalid guest state in vmcs12. > > Thinking about it more, L0 can only reach handle_invalid_guest_state() > > in the context of L2 if L0 has a bug, or if L1 attempts to run L2 with > > invalid state, e.g. L1 knowingly ignores the fact that unrestricted guest > > is disabled or has a bug of its own. > Right, it shouldn't be happening on recent hardware. > > > > > So, I think the correct fix would be to return 1 from prepare_vmcs02 > > if emulation_required is true, i.e. signal VMEntry failed due to > > EXIT_REASON_INVALID_STATE, and add a BUG either in vmx_vcpu_run() or > > handle_invalid_guest_state() that fires if we're emulating L2, i.e. > > BUG_ON(vmx->emulation_required && vmx->nested.nested_run_pending); > Sounds good (with a WARN_ONCE like Paolo said, because a guest can > configure it and KVM should support it), > > thanks.