2018-03-13 13:22-0700, Sean Christopherson: > On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote: > > 2018-03-09 23:06+0000, Christopherson, Sean J: > > > > > > On Friday, 2018-03-09, Radim Krčmář wrote: > > > > > > > > 2018-03-05 09:39-0800, Sean Christopherson: > > > > > > > > > > Clear nested_run_pending in handle_invalid_guest_state() after calling > > > > > emulate_instruction(), i.e. after attempting to emulate at least one > > > > > instruction. This fixes an issue where L0 enters an infinite loop if > > > > > L2 hits an exception that is intercepted by L1 while L0 is emulating > > > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only > > > > > way to break the loop is to kill Qemu in L0. > > > > > > > > > > 1. call handle_invalid_guest_state() for L2 > > > > > 2. emulate_instruction() pends an exception, e.g. #UD > > > > > 3. L1 intercepts the exception, i.e. nested_vmx_check_exception > > > > > returns 1 > > > > > 4. vmx_check_nested_events() returns -EBUSY because L1 wants to > > > > > intercept the exception and nested_run_pending is true > > > > > 5. handle_invalid_guest_state() never makes forward progress for > > > > > L2 due to the pending exception > > > > > 6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the > > > > > L1 vCPU trying VMLAUNCH effectively hangs > > > > > > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > > > > > --- > > > > nested_run_pending signals that we have to execute VMRESUME in order to > > > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD). > > > > > > > > If we don't let the hardware do it, we need to transfer the state from > > > > L2's VMCS while doing a nested VM exit for the exception (= behave as if > > > > we entered the guest and exited). > > > Right, but by virtue of being in handle_invalid_guest_state() we're > > > already effectively post-VMRESUME. handle_invalid_guest_state() is > > > technically called in the VMEXIT flow, but the initial check against > > > emulation_required is done in vmx_vcpu_run(). Conceptually, emulating > > > due to invalid state is distinctly different from emulating due to a > > > specific VMEXIT. In the latter case, we're emulating an instruction > > > that has already been executed/attempted in the guest, whereas in the > > > invalid guest case we're emulating instructions that haven't been seen > > > before. > > On a higher level of abstraction, KVM is emulating CPU just the same. > > Here, we're starting by emulating L1's VMLAUNCH. KVM cannot accelerate > > the emulation of the VMLAUNCH instruction nor instructions that come > > after VMLAUNCH (L2) due to missing hardware features. > > > > The purpose of nested_run_pending is to finish the interrupt injection > > of VMLAUNCH, which should be done before any L2 instruction is executed. > > Normal flow uses (L0's) VMRESUME acceleration for that, but that is not > > available in our case. > > (Maybe it's possible to accelerate by manipulating VMCS into a valid > > state and forcing an immediate VM exit.) > > There is nothing to inject at the time of VMLAUNCH. L1's VMLAUNCH has > completed This seems to be the core of disagreement. The emulated VMLAUNCH depends on VMX to inject whatever is in VM_ENTRY_INTR_INFO_FIELD and if we never do a VMRESUME in L0, then the VMLAUNCH is incomplete. Maybe I'm missing the point of nested_run_pending -- why do we have it at all if not to signal that there is a pending injection from vmcs12? > we're running/emulating L2 with nothing pending when a > #UD is encountered in L2. In the unrestricted guest case, where we > accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated > VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD). > In the restricted guest case, we bailed out of vcpu_run very early > without clearing nested_run_pending. Yes, we do not need to inject in this case (which we could detect and just never set nested_run_pending in the first place), but we need to in the case where L1 sets pending interrupt in VMCS. > Here's a comparison of how a #UD in L2 w/ith invalid guest state would > be handled with unrestricted guest enabled vs. disabled, and why I > thought it made sense to clear nested_run_pending in > handle_invalid_guest_state(). > > > Unrestricted Guest > > H/W Emu Action Result > -------------------------------------- > L0 * vcpu_run vmcs01 switch to L1 > L1 * vcpu_run vmcs12 exit to L0 > L0 * nested_vmx_run vmx->nested.nested_run_pending = 1 > L0 * vcpu_run vmcs02 switch to L2 The hardware injected an interrupt from vmcs12. > L2 * L2 does stuff... > L2 * #UD exit to L0 > L0 * vcpu_run postamble vmx->nested.nested_run_pending = 0 > L0 * inject VMEXIT(#UD) succeeds > L0 * vcpu_run vmcs01 switch to L1 > > > Restricted Guest > > H/W Emu Action Result > -------------------------------------- > L0 * vcpu_run vmcs01 switch to L1 > L1 * vcpu_run vmcs12 exit to L0 > L0 * nested_vmx_run vmx->nested.nested_run_pending = 1 > L0 * vcpu_run vmcs02 bails due to emulation_required <<Here>> is no code to inject an interrupt from vmcs12. > L0 L2 emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12 > L0 L2 L2 does stuff... > L0 L2 #UD exit to L0 (emulated) > > <this is where I was suggesting we clear nested_run_pending> But why not earlier, e.g. before emulating instructions in the invalid state? I don't understand what nested_run_pending is supposed to do in this case. I think the right place to clear nested_run_pending is at <<Here>>, where we also do what hardware with unrestricted guest does. Thanks. > L0 * inject VMEXIT(#UD) fails, nested_run_pending is set > L0 L2 infinite loop...