Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume

Abel Gordon <ABELG@xxxxxxxxxx> · Mon, 9 Nov 2009 11:33:02 +0200

Gleb Natapov <gleb@xxxxxxxxxx> wrote on 29/10/2009 19:31:05:

> [image removed]
>
> Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume
>
> Gleb Natapov
>
> to:
>
> Orit Wasserman
>
> 29/10/2009 19:31
>
> Cc:
>
> Abel Gordon, aliguori, Ben-Ami Yassour1, kvm, mdday, Muli Ben-Yehuda
>
> On Wed, Oct 28, 2009 at 06:23:42PM +0200, Orit Wasserman wrote:
> >
> >
> > Gleb Natapov <gleb@xxxxxxxxxx> wrote on 25/10/2009 11:44:31:
> >
> > > From:
> > >
> > > Gleb Natapov <gleb@xxxxxxxxxx>
> > >
> > > To:
> > >
> > > Orit Wasserman/Haifa/IBM@IBMIL
> > >
> > > Cc:
> > >
> > > Abel Gordon/Haifa/IBM@IBMIL, aliguori@xxxxxxxxxx, Ben-Ami Yassour1/
> > > Haifa/IBM@IBMIL, kvm@xxxxxxxxxxxxxxx, mdday@xxxxxxxxxx, Muli Ben-
> > > Yehuda/Haifa/IBM@IBMIL
> > >
> > > Date:
> > >
> > > 25/10/2009 11:44
> > >
> > > Subject:
> > >
> > > Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume
> > >
> > > On Thu, Oct 22, 2009 at 05:46:16PM +0200, Orit Wasserman wrote:
> > > >
> > > >
> > > > Gleb Natapov <gleb@xxxxxxxxxx> wrote on 22/10/2009 11:04:58:
> > > >
> > > > > From:
> > > > >
> > > > > Gleb Natapov <gleb@xxxxxxxxxx>
> > > > >
> > > > > To:
> > > > >
> > > > > Orit Wasserman/Haifa/IBM@IBMIL
> > > > >
> > > > > Cc:
> > > > >
> > > > > Abel Gordon/Haifa/IBM@IBMIL, aliguori@xxxxxxxxxx, Ben-Ami
Yassour1/
> > > > > Haifa/IBM@IBMIL, kvm@xxxxxxxxxxxxxxx, mdday@xxxxxxxxxx, Muli Ben-
> > > > > Yehuda/Haifa/IBM@IBMIL
> > > > >
> > > > > Date:
> > > > >
> > > > > 22/10/2009 11:05
> > > > >
> > > > > Subject:
> > > > >
> > > > > Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and
vmresume
> > > > >
> > > > > On Wed, Oct 21, 2009 at 04:43:44PM +0200, Orit Wasserman wrote:
> > > > > > > > @@ -4641,10 +4955,13 @@ static void vmx_complete_interrupts
> > (struct
> > > > > > > vcpu_vmx *vmx)
> > > > > > > >     int type;
> > > > > > > >     bool idtv_info_valid;
> > > > > > > >
> > > > > > > > -   exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> > > > > > > > -
> > > > > > > >     vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
> > > > > > > >
> > > > > > > > +   if (vmx->nested.nested_mode)
> > > > > > > > +      return;
> > > > > > > > +
> > > > > > > Why return here? What the function does that should not be
done
> > in
> > > > > > > nested mode?
> > > > > > In nested mode L0 injects an interrupt to L2 only in one
scenario,
> > > > > > if there is an IDT_VALID event and L0 decides to run L2 again
and
> > not
> > > > to
> > > > > > switch back to L1.
> > > > > > In all other cases the injection is handled by L1.
> > > > > This is exactly the kind of scenario that is handled by
> > > > > vmx_complete_interrupts(). (vmx|svm)_complete_interrups() store
> > > > > pending event in arch agnostic way and re-injection is handled by
> > > > > x86.c You bypass this logic by inserting return here and
introducing
> > > > > nested_handle_valid_idt() function below.
> > > > The only location we can truly know if we are switching to L1 is in
> > > > vmx_vcpu_run
> > > > because enable_irq_window (that is called after handling the exit)
can
> > > > decide to
> > > > switch to L1 because of an interrupt.
> > > enable_irq_window() will be called after L2 VMCS will be setup for
event
> > > re-injection by previous call to inject_pending_event(). As far as I
> > > can see this should work for interrupt injection. For exception we
> > > should probably require l2 guest to re execute faulted instruction
for
> > > now like svm does.
> > The main issue is that L0 doesn't inject events to L2 but L1 hypervisor
(we
> > want to keep the nested hypervisor semantics as
> > much as possible). Only if the event was caused by the fact that L2 is
a
> > nested guest
> > and L1 can't handle it L0 will re-inject and event to L2, for example
IDT
> > event
> > with page fault that is caused by a missing entry in SPT02 (the shadow
page
> > table L0 create for L2).
> > In this case when vmx_complete_intterupts is called L0 doesn't know if
the
> > page fault should be handled by it or
> > by L1 (it is decided later when handling the exit).
> So what? When it will be decided that L2 exit is needed pending event
> will be transfered into L2's idt_vectoring_info. Otherwise event will be
> reinfected by usual mechanism. BTW I don't see where you current code
> setup L2's idt_vectoring_info if it is decided that L1 should handle
> event re-injection.
Suppose we are executing an L2 guest and we got an exit. There are 2
possible scenarios here:
A) The L2 exit will be handled by the L1 guest hypervisor. In this case
when we switch to L1 the IDT vectoring info field is copied from vmcs(02)
to vmcs(12) in prepare_vmcs_12 (part of the nested_vmx_vmexit path). Now is
under responsibility of L1 to deal with the IDT and do the corresponding
logic.
B) The L2 exit will be handled only by L0. In this case we never switch to
L1. L0 handles the exit and resume L2. Any pending event in vmcs(02) idt
vectoring info field is injected to l2 when L0 resumes it.

KVM handles IDT in at the end of vmx_vcpu_run, calling
vmx_complete_interrupts. The decision to switch or not switch to L1 is made
in the following points:
1) nested_vmx_check_exception (called from vmx_queue_exception)
2) nested_vmx_intr (called from vmx_interrupt_allowed and
enable_irq_window)
3) vmx_handle_exit

>From x86 perspective the flow looks as follow:
vcpu_enter_guest {
 1
 2
 run (includes vmx_complete_interrupts)
 3
}

All these functions are called after vmx_vcpu_run finished and
vmx_complete_interrupts was already executed. This impede us to re-use
regular non-nested IDT hadling because we still don't know there if the IDT
pending event must be injected or not. That's the reason we added the
function nested_handle_valid_idt which is called at the beginning of
vmx_vcpu_run. So now, the flow from x86 perspective will look like:
vcpu_enter_guest {
 1
 2
 check_nested_idt (injects pending IDT to L2 if necessary. only case B)
 run
 3
}

> > In most other cases , L0 will switch to L1 and L1 will decide if there
will
> > be re-injection
> > (depends on the L1 hypervisor logic) and update L2 VMCS accordingly.
> > >
> > > > In order to simplify our code it was simpler to bypass
> > > > vmx_complete_interrupts when it is called (after
> > > > running L2) and to add nested_handle_valid_idt just before running
L2.
> > > > > > >
> > > > > > > > +   exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> > > > > > > > +
> > > > > > > >     /* Handle machine checks before interrupts are enabled
*/
> > > > > > > >     if ((vmx->exit_reason ==
EXIT_REASON_MCE_DURING_VMENTRY)
> > > > > > > >         || (vmx->exit_reason == EXIT_REASON_EXCEPTION_NMI
> > > > > > > > @@ -4747,6 +5064,60 @@ static void fixup_rmode_irq(struct
> > vcpu_vmx
> > > > > > *vmx)
> > > > > > > >        | vmx->rmode.irq.vector;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static int nested_handle_valid_idt(struct kvm_vcpu *vcpu)
> > > > > > > > +{
> > > > > > > It seems by this function you are trying to bypass general
event
> > > > > > > reinjection logic. Why?
> > > > > > See above.
> > > > > The logic implemented by this function is handled in x86.c in
arch
> > > > > agnostic way. Is there something wrong with this?
> > > > See my comment before
> > > Sometimes it is wrong to reinject events from L0 to L2 directly. If
L2
> > > was not able to handle event because its IDT is not mapped by L1
shadow
> > > page table we should generate PF vmexit with valid idt vectoring info
to
> > > L1 and let L1 handle event reinjection.
According to above explanation I think this is what we are doing in the
required case (A). Are we missing something ?

Abel.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html