Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending

Jim Mattson <jmattson@xxxxxxxxxx> · Mon, 27 Nov 2017 22:39:10 -0800

On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@xxxxxxxxxx> wrote:
> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
> reinject before next resume of L2.
>
> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
> to the L1 vCPU which currently runs L2 and exited to L0.
>
> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
> it will note that a previous event was re-injected to L2 (by
> IDT-vectoring-info) and therefore won't check if there are pending L1
> events which require exit from L2 to L1. Thus, L0 enters L2 without
> immediate VMExit even though there are pending L1 events!
>
> This commit fixes the issue by making sure to check for L1 pending
> events even if a previous event was reinjected to L2 and bailing out
> from inject_pending_event() before evaluating a new pending event in
> case an event was already reinjected.
>
> The bug was observed by the following setup:
> * L0 is a 64CPU machine which runs KVM.
> * L1 is a 16CPU machine which runs KVM.
> * L0 & L1 runs with APICv disabled.
> (Also reproduced with APICv enabled but easier to analyze below info
> with APICv disabled)
> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
> During L2 boot, L1 hangs completely and analyzing the hang reveals that
> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
> that he has sent for another L1 vCPU. And all other L1 vCPUs are
> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
> forever (as L1 runs with kernel-preemption disabled).
>
> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
> series of events:
> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000

This looks like an EPT violation for writing a stack page that wasn't
present in the vmcs02 shadow EPT tables. I assume that L0 filled in
the missing shadow EPT table entry (or entries) and dismissed this EPT
violation itself rather than forwarding it to L1, because L1's EPT
tables have a valid entry for this L2 guest-physical address.

> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
> vec 252 (Fixed|edge)

This is the source of the bug, right here. An interrupt can only be
accepted at an instruction boundary, and the virtual CPU is in the
midst of IDT-vectoring, which is not at an instruction boundary. The
EPT violation from L2 to L0 is an L0 artifact that should be
completely invisible to L1/L2. Interrupt 252 should remain pending
until we are at the first instruction of the IDT handler for interrupt
210.

To be more precise, if interrupt 210 was delivered via VM-entry event
injection on L1 VM-entry to L2, then we need to:
a) complete the event injection (i.e. vector through the L2 IDT for vector 210),
b) handle virtual VMX-preemption timer expiration during VM-entry (if any),
c) handle NMI-window exiting events (if any),
d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the
L2 IDT, depending on the NMI-exiting control),
e) handle interrupt-window exiting events,
f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by
vectoring through the L2 IDT, depending on the external-iinterrupt
exiting control).
...

Only when we get to step (f) can we accept a new IRQ from L0's local APIC.

> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
> rip 0xffffe00069202690 info 83 0
> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
> ext_int: 0x00000000 ext_int_err: 0x00000000
> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>
> Which can be analyzed as follows:
> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
> a pending-interrupt of 0xd2.
> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
> notes that interrupt 0xd2 was reinjected and therefore calls
> vmx_inject_irq() and returns. Without checking for pending L1 events!
> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
> before calling inject_pending_event().
> (4) L0 resumes L2 without immediate-exit even though there is a pending
> L1 event (The IPI pending in LAPIC's IRR).
>
> We have already reached the buggy scenario but events could be
> furthered analyzed:
> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
> event-delivery.
> (7) L0 decides to forward the VMExit to L1 for further handling.
> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
> LAPIC's IRR is not examined and therefore the IPI is still not delivered
> into L1!
>
> Signed-off-by: Liran Alon <liran.alon@xxxxxxxxxx>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@xxxxxxxxxx>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> ---
>  arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>  1 file changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a4b5a013064b..63359ab0e798 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>         int r;
>
>         /* try to reinject previous events if any */
> -       if (vcpu->arch.exception.injected) {
> -               kvm_x86_ops->queue_exception(vcpu);
> -               return 0;
> -       }
>
> +       if (vcpu->arch.exception.injected)
> +               kvm_x86_ops->queue_exception(vcpu);
>         /*
>          * NMI/interrupt must not be injected if an exception is
>          * pending, because the latter may have been queued by
>          * handling exit due to NMI/interrupt delivery.
>          */
> -       if (!vcpu->arch.exception.pending) {
> -               if (vcpu->arch.nmi_injected) {
> +       else if (!vcpu->arch.exception.pending) {
> +               if (vcpu->arch.nmi_injected)
>                         kvm_x86_ops->set_nmi(vcpu);
> -                       return 0;
> -               }
> -
> -               if (vcpu->arch.interrupt.injected) {
> +               else if (vcpu->arch.interrupt.injected)
>                         kvm_x86_ops->set_irq(vcpu);
> -                       return 0;
> -               }
>         }
>
> +       /*
> +        * Call check_nested_events() even if we reinjected a previous event
> +        * in order for caller to determine if it should require immediate-exit
> +        * from L2 to L1 due to pending L1 events which require exit
> +        * from L2 to L1.
> +        */
>         if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>                 r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>                 if (r != 0)
> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>                                         vcpu->arch.exception.has_error_code,
>                                         vcpu->arch.exception.error_code);
>
> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>                 vcpu->arch.exception.pending = false;
>                 vcpu->arch.exception.injected = true;
>
> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>                 }
>
>                 kvm_x86_ops->queue_exception(vcpu);
> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
> +       }
> +
> +       /* Don't consider new event if we re-injected an event */
> +       if (kvm_event_needs_reinjection(vcpu))
> +               return 0;
> +
> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
> +           kvm_x86_ops->smi_allowed(vcpu)) {
>                 vcpu->arch.smi_pending = false;
>                 enter_smm(vcpu);
>         } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
> --
> 1.9.1
>