On Tue, 2018-03-13 at 22:59 +0100, Radim Krcmár wrote: > 2018-03-13 13:22-0700, Sean Christopherson: > > > > On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote: > > > > > > 2018-03-09 23:06+0000, Christopherson, Sean J: > > > > > > > > > > > > On Friday, 2018-03-09, Radim Krčmář wrote: > > > > > > > > > > > > > > > 2018-03-05 09:39-0800, Sean Christopherson: > > > > > > > > > > > > > > > > > > Clear nested_run_pending in handle_invalid_guest_state() after calling > > > > > > emulate_instruction(), i.e. after attempting to emulate at least one > > > > > > instruction. This fixes an issue where L0 enters an infinite loop if > > > > > > L2 hits an exception that is intercepted by L1 while L0 is emulating > > > > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only > > > > > > way to break the loop is to kill Qemu in L0. > > > > > > > > > > > > 1. call handle_invalid_guest_state() for L2 > > > > > > 2. emulate_instruction() pends an exception, e.g. #UD > > > > > > 3. L1 intercepts the exception, i.e. nested_vmx_check_exception > > > > > > returns 1 > > > > > > 4. vmx_check_nested_events() returns -EBUSY because L1 wants to > > > > > > intercept the exception and nested_run_pending is true > > > > > > 5. handle_invalid_guest_state() never makes forward progress for > > > > > > L2 due to the pending exception > > > > > > 6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the > > > > > > L1 vCPU trying VMLAUNCH effectively hangs > > > > > > > > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > > > > > > --- > > > > > nested_run_pending signals that we have to execute VMRESUME in order to > > > > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD). > > > > > > > > > > If we don't let the hardware do it, we need to transfer the state from > > > > > L2's VMCS while doing a nested VM exit for the exception (= behave as if > > > > > we entered the guest and exited). > > > > Right, but by virtue of being in handle_invalid_guest_state() we're > > > > already effectively post-VMRESUME. handle_invalid_guest_state() is > > > > technically called in the VMEXIT flow, but the initial check against > > > > emulation_required is done in vmx_vcpu_run(). Conceptually, emulating > > > > due to invalid state is distinctly different from emulating due to a > > > > specific VMEXIT. In the latter case, we're emulating an instruction > > > > that has already been executed/attempted in the guest, whereas in the > > > > invalid guest case we're emulating instructions that haven't been seen > > > > before. > > > On a higher level of abstraction, KVM is emulating CPU just the same. > > > Here, we're starting by emulating L1's VMLAUNCH. KVM cannot accelerate > > > the emulation of the VMLAUNCH instruction nor instructions that come > > > after VMLAUNCH (L2) due to missing hardware features. > > > > > > The purpose of nested_run_pending is to finish the interrupt injection > > > of VMLAUNCH, which should be done before any L2 instruction is executed. > > > Normal flow uses (L0's) VMRESUME acceleration for that, but that is not > > > available in our case. > > > (Maybe it's possible to accelerate by manipulating VMCS into a valid > > > state and forcing an immediate VM exit.) > > There is nothing to inject at the time of VMLAUNCH. L1's VMLAUNCH has > > completed > This seems to be the core of disagreement. The emulated VMLAUNCH > depends on VMX to inject whatever is in VM_ENTRY_INTR_INFO_FIELD and if > we never do a VMRESUME in L0, then the VMLAUNCH is incomplete. Ah, I see where you're coming from. Digging a little deeper, I think the underlying issue is that handle_invalid_guest_state() doesn't properly handle exceptions that are pended by emulate_instruction(). Regardless of whether we're emulating L1 or L2, we'll enter an infinite loop if an emulation_required is true and an exception is pending. As you said, we rely on VMRESUME to inject the exception, but we'll never execute VMRESUME if emulation_required is true (and it would fail due to invalid guest state if we tried to force the injection). I verified this by generating a #GP while emulating Big RM for L1; KVM soft hangs and never returns to userspace. I think the fix would be to mirror what handle_emulation_failure() does if it gets a #UD at CPL0 and return KVM_INTERNAL_ERROR_EMULATION to userspace if an exception is pending and emulation_required is true. Note that a #UD in L1 will already do this because of the aforementioned handle_emulation_failure(). > Maybe I'm missing the point of nested_run_pending -- why do we have it > at all if not to signal that there is a pending injection from vmcs12? My understanding is that nested_run_pending is quite literal: there is a nested VM that needs to be run, do that next. I don't think it's directly related to a pending injection, but rather is queried to know what it is or isn't legal, e.g. in this case injecting a VMEXIT into L1 isn't legal because we need to run L2 next. > > > > we're running/emulating L2 with nothing pending when a > > #UD is encountered in L2. In the unrestricted guest case, where we > > accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated > > VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD). > > In the restricted guest case, we bailed out of vcpu_run very early > > without clearing nested_run_pending. > Yes, we do not need to inject in this case (which we could detect and > just never set nested_run_pending in the first place), but we need to in > the case where L1 sets pending interrupt in VMCS. > > > > > Here's a comparison of how a #UD in L2 w/ith invalid guest state would > > be handled with unrestricted guest enabled vs. disabled, and why I > > thought it made sense to clear nested_run_pending in > > handle_invalid_guest_state(). > > > > > > Unrestricted Guest > > > > H/W Emu Action Result > > -------------------------------------- > > L0 * vcpu_run vmcs01 switch to L1 > > L1 * vcpu_run vmcs12 exit to L0 > > L0 * nested_vmx_run vmx->nested.nested_run_pending = 1 > > L0 * vcpu_run vmcs02 switch to L2 > The hardware injected an interrupt from vmcs12. > > > > > L2 * L2 does stuff... > > L2 * #UD exit to L0 > > L0 * vcpu_run postamble vmx->nested.nested_run_pending = 0 > > L0 * inject VMEXIT(#UD) succeeds > > L0 * vcpu_run vmcs01 switch to L1 > > > > > > Restricted Guest > > > > H/W Emu Action Result > > -------------------------------------- > > L0 * vcpu_run vmcs01 switch to L1 > > L1 * vcpu_run vmcs12 exit to L0 > > L0 * nested_vmx_run vmx->nested.nested_run_pending = 1 > > L0 * vcpu_run vmcs02 bails due to emulation_required > <<Here>> is no code to inject an interrupt from vmcs12. > > > > > L0 L2 emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12 > > L0 L2 L2 does stuff... > > L0 L2 #UD exit to L0 (emulated) > > > > <this is where I was suggesting we clear nested_run_pending> > But why not earlier, e.g. before emulating instructions in the invalid > state? I don't understand what nested_run_pending is supposed to do in > this case. I think the right place to clear nested_run_pending is at > <<Here>>, where we also do what hardware with unrestricted guest does. In an alternate universe where we weren't fixing the underlying bugs, I would have no qualms about clearing nested_run_pending earlier. I put it after emulate_instruction because I was trying to be conservative, i.e. I wanted to put it in a location where we could say without a doubt that we "ran" L2. > Thanks. > > > > > L0 * inject VMEXIT(#UD) fails, nested_run_pending is set > > L0 L2 infinite loop...