RE: [PATCH] KVM: nVMX: clear nested_run_pending when emulating invalid guest state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2018-03-14 at 08:22 -0700, Christopherson, Sean J wrote:
> On Tue, 2018-03-13 at 22:59 +0100, Radim Krcmár wrote:
> > 
> > 2018-03-13 13:22-0700, Sean Christopherson:
> > > 
> > > 
> > > On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote:
> > > > 
> > > > 
> > > > 2018-03-09 23:06+0000, Christopherson, Sean J:
> > > > > 
> > > > > 
> > > > > 
> > > > > On Friday, 2018-03-09, Radim Krčmář wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 2018-03-05 09:39-0800, Sean Christopherson:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > > > > > > emulate_instruction(), i.e. after attempting to emulate at least one
> > > > > > > instruction.  This fixes an issue where L0 enters an infinite loop if
> > > > > > > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > > > > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > > > > > > way to break the loop is to kill Qemu in L0.
> > > > > > > 
> > > > > > >     1. call handle_invalid_guest_state() for L2
> > > > > > >     2. emulate_instruction() pends an exception, e.g. #UD
> > > > > > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> > > > > > >        returns 1
> > > > > > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> > > > > > >        intercept the exception and nested_run_pending is true
> > > > > > >     5. handle_invalid_guest_state() never makes forward progress for
> > > > > > >        L2 due to the pending exception
> > > > > > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> > > > > > >        L1 vCPU trying VMLAUNCH effectively hangs
> > > > > > > 
> > > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > > > > > > ---
> > > > > > nested_run_pending signals that we have to execute VMRESUME in order to
> > > > > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> > > > > > 
> > > > > > If we don't let the hardware do it, we need to transfer the state from
> > > > > > L2's VMCS while doing a nested VM exit for the exception (= behave as if
> > > > > > we entered the guest and exited).
> > > > > Right, but by virtue of being in handle_invalid_guest_state() we're
> > > > > already effectively post-VMRESUME.  handle_invalid_guest_state() is
> > > > > technically called in the VMEXIT flow, but the initial check against
> > > > > emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
> > > > > due to invalid state is distinctly different from emulating due to a
> > > > > specific VMEXIT.  In the latter case, we're emulating an instruction
> > > > > that has already been executed/attempted in the guest, whereas in the
> > > > > invalid guest case we're emulating instructions that haven't been seen
> > > > > before.
> > > > On a higher level of abstraction, KVM is emulating CPU just the same.
> > > > Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate
> > > > the emulation of the VMLAUNCH instruction nor instructions that come
> > > > after VMLAUNCH (L2) due to missing hardware features.
> > > > 
> > > > The purpose of nested_run_pending is to finish the interrupt injection
> > > > of VMLAUNCH, which should be done before any L2 instruction is executed.
> > > > Normal flow uses (L0's) VMRESUME acceleration for that, but that is not
> > > > available in our case.
> > > > (Maybe it's possible to accelerate by manipulating VMCS into a valid
> > > >  state and forcing an immediate VM exit.)
> > > There is nothing to inject at the time of VMLAUNCH.  L1's VMLAUNCH has
> > > completed
> > This seems to be the core of disagreement.  The emulated VMLAUNCH
> > depends on VMX to inject whatever is in VM_ENTRY_INTR_INFO_FIELD and if
> > we never do a VMRESUME in L0, then the VMLAUNCH is incomplete.
> Ah, I see where you're coming from.  Digging a little deeper, I think
> the underlying issue is that handle_invalid_guest_state() doesn't
> properly handle exceptions that are pended by emulate_instruction().
> Regardless of whether we're emulating L1 or L2, we'll enter an infinite
> loop if an emulation_required is true and an exception is pending.  As
> you said, we rely on VMRESUME to inject the exception, but we'll never
> execute VMRESUME if emulation_required is true (and it would fail due
> to invalid guest state if we tried to force the injection).  I verified
> this by generating a #GP while emulating Big RM for L1; KVM soft hangs
> and never returns to userspace.

Argh, my analysis was wrong for Big RM, though it's correct for invalid
state in protected mode.  KVM fully emulates interrupts for Big RM, i.e.
doesn't rely on VMRESUME for injection.  The soft hang I saw was because
pc-bios handled the #GP but didn't skip the faulting instruction.

Invalid PM on the other hand doesn't support emulating interrupts, e.g.
a #GP while emulating invalid guest state in PM will be written into
the VMCS even though emulation_required is true.  I'll take a stab at a
patch to handle the most obvious case of hitting an exception during
handle_invalid_guest_state.

> I think the fix would be to mirror what handle_emulation_failure()
> does if it gets a #UD at CPL0 and return KVM_INTERNAL_ERROR_EMULATION
> to userspace if an exception is pending and emulation_required is true.
> Note that a #UD in L1 will already do this because of the aforementioned
> handle_emulation_failure().
> 
> > 
> > Maybe I'm missing the point of nested_run_pending -- why do we have it
> > at all if not to signal that there is a pending injection from vmcs12?
> My understanding is that nested_run_pending is quite literal: there is
> a nested VM that needs to be run, do that next.  I don't think it's
> directly related to a pending injection, but rather is queried to know
> what it is or isn't legal, e.g. in this case injecting a VMEXIT into L1
> isn't legal because we need to run L2 next.
> 
> > 
> > > 
> > > 
> > >               we're running/emulating L2 with nothing pending when a
> > > #UD is encountered in L2.  In the unrestricted guest case, where we
> > > accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated
> > > VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD).
> > > In the restricted guest case, we bailed out of vcpu_run very early
> > > without clearing nested_run_pending.
> > Yes, we do not need to inject in this case (which we could detect and
> > just never set nested_run_pending in the first place), but we need to in
> > the case where L1 sets pending interrupt in VMCS.
> > 
> > > 
> > > 
> > > Here's a comparison of how a #UD in L2 w/ith invalid guest state would
> > > be handled with unrestricted guest enabled vs. disabled, and why I
> > > thought it made sense to clear nested_run_pending in
> > > handle_invalid_guest_state().
> > > 
> > > 
> > > Unrestricted Guest
> > > 
> > > H/W  Emu    Action              Result
> > > --------------------------------------
> > > L0   *      vcpu_run vmcs01     switch to L1
> > > L1   *      vcpu_run vmcs12     exit to L0
> > > L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
> > > L0   *      vcpu_run vmcs02     switch to L2
> > The hardware injected an interrupt from vmcs12.
> > 
> > > 
> > > 
> > > L2   *      L2 does stuff...    
> > > L2   *      #UD                 exit to L0
> > > L0   *      vcpu_run postamble  vmx->nested.nested_run_pending = 0
> > > L0   *      inject VMEXIT(#UD)  succeeds
> > > L0   *      vcpu_run vmcs01     switch to L1
> > > 
> > > 
> > > Restricted Guest
> > > 
> > > H/W  Emu    Action              Result
> > > --------------------------------------
> > > L0   *      vcpu_run vmcs01     switch to L1
> > > L1   *      vcpu_run vmcs12     exit to L0
> > > L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
> > > L0   *      vcpu_run vmcs02     bails due to emulation_required
> > <<Here>> is no code to inject an interrupt from vmcs12.
> > 
> > > 
> > > 
> > > L0   L2     emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12
> > > L0   L2     L2 does stuff...    
> > > L0   L2     #UD                 exit to L0 (emulated)
> > > 
> > > <this is where I was suggesting we clear nested_run_pending>
> > But why not earlier, e.g. before emulating instructions in the invalid
> > state?  I don't understand what nested_run_pending is supposed to do in
> > this case.  I think the right place to clear nested_run_pending is at
> > <<Here>>, where we also do what hardware with unrestricted guest does.
> In an alternate universe where we weren't fixing the underlying bugs,
> I would have no qualms about clearing nested_run_pending earlier.  I put
> it after emulate_instruction because I was trying to be conservative,
> i.e. I wanted to put it in a location where we could say without a doubt
> that we "ran" L2.
> 
> > 
> > Thanks.
> > 
> > > 
> > > 
> > > L0   *      inject VMEXIT(#UD)  fails, nested_run_pending is set
> > > L0   L2     infinite loop...




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux