Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending

Liran Alon <LIRAN.ALON@xxxxxxxxxx> · Tue, 28 Nov 2017 21:33:17 +0200

On 28/11/17 08:39, Jim Mattson wrote:
On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@xxxxxxxxxx> wrote:
In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
IDT-vectoring-info which vmx_complete_interrupts() makes sure to
reinject before next resume of L2.

While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
to the L1 vCPU which currently runs L2 and exited to L0.

When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
it will note that a previous event was re-injected to L2 (by
IDT-vectoring-info) and therefore won't check if there are pending L1
events which require exit from L2 to L1. Thus, L0 enters L2 without
immediate VMExit even though there are pending L1 events!

This commit fixes the issue by making sure to check for L1 pending
events even if a previous event was reinjected to L2 and bailing out
from inject_pending_event() before evaluating a new pending event in
case an event was already reinjected.

The bug was observed by the following setup:
* L0 is a 64CPU machine which runs KVM.
* L1 is a 16CPU machine which runs KVM.
* L0 & L1 runs with APICv disabled.
(Also reproduced with APICv enabled but easier to analyze below info
with APICv disabled)
* L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
During L2 boot, L1 hangs completely and analyzing the hang reveals that
one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
that he has sent for another L1 vCPU. And all other L1 vCPUs are
currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
forever (as L1 runs with kernel-preemption disabled).

Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
series of events:
(1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000

This looks like an EPT violation for writing a stack page that wasn't
present in the vmcs02 shadow EPT tables. I assume that L0 filled in
the missing shadow EPT table entry (or entries) and dismissed this EPT
violation itself rather than forwarding it to L1, because L1's EPT
tables have a valid entry for this L2 guest-physical address.
Yes. This is how I see it as well. One can see in (4) that L0 resumed L2 
without exiting to L1.


(2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
vec 252 (Fixed|edge)

This is the source of the bug, right here. An interrupt can only be
accepted at an instruction boundary, and the virtual CPU is in the
midst of IDT-vectoring, which is not at an instruction boundary. The
EPT violation from L2 to L0 is an L0 artifact that should be
completely invisible to L1/L2. Interrupt 252 should remain pending
until we are at the first instruction of the IDT handler for interrupt
210.
The trace kvm_apic_accept_irq is printed from __apic_accept_irq().
This function is called by IPI *sender* CPU after it has found the vCPU 
LAPIC to which an IPI should be delivered to.
See how function is being called in the following code path: 
kvm_lapic_reg_write() APIC_ICR handler -> apic_send_ipi() -> 
kvm_irq_delivery_to_apic() -> kvm_apic_set_irq() -> __apic_accept_irq().

In addition, following __apic_accept_irq() APIC_DM_FIXED handler, one 
can see how the IPI is actually delivered to dest CPU. In this case 
APICv is disabled and therefore IPI is basically delivered by setting 
relevant bit in dest LAPIC IRR, setting KVM_REQ_EVENT and kicking the 
dest vCPU in case it is blocked / inside guest.
This will cause dest CPU to eventually reach vcpu_enter_guest() which 
will consume KVM_REQ_EVENT and call inject_pending_event(). Only then it 
should be noticed that kvm_cpu_has_injectable_intr() returns true (as 
bit is set in LAPIC IRR) and then kvm_x86_ops->set_irq() is called to 
actually inject the IRQ to guest by writing it to VMCS.

Therefore, this is not the source of the bug in my opinion. The 
interrupt isn't really accepted at an instruction boundary...

To be more precise, if interrupt 210 was delivered via VM-entry event
injection on L1 VM-entry to L2, then we need to:
a) complete the event injection (i.e. vector through the L2 IDT for vector 210),
b) handle virtual VMX-preemption timer expiration during VM-entry (if any),
c) handle NMI-window exiting events (if any),
d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the
L2 IDT, depending on the NMI-exiting control),
e) handle interrupt-window exiting events,
f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by
vectoring through the L2 IDT, depending on the external-iinterrupt
exiting control).
...

Only when we get to step (f) can we accept a new IRQ from L0's local APIC.

This is exactly what inject_pending_event() should be doing after this 
commit.

It starts by checking if there is some event to "re-inject". Which is 
basically step (a). Then it calls check_nested_events() which basically 
handles steps (b)+(c)+(d)+(e). Then it evaluates if there is a pending 
exception to inject (it could have been queued during handling of exit 
due to event-delivery) and if yes it injects it to guest.
Only then we check if we have already injected some event to guest. If 
yes, resume guest with the injected-event written in the VMCS already.
Otherwise, continue evaluating more pending events such as pending L1 
SMIs, NMIs or LAPIC interrupts and inject them if possible.

The problem this patch attempt to fix is that before this patch, after 
performing step (a), if it indeed "re-injected" some event, we would not 
perform any evaluation of if there is some pending event which should 
require us to exit from L2 to L1.
In other words, what happened here is that completing the "re-injection" 
of the event which was delivered and caused EPT_VIOLATION, blocked us 
from evaluating the pending L1 IPI which was sent by another CPU and 
therefore we missed an exit on EXTERNAL_INTERRUPT from L2 to L1.

Regards,
-Liran

(3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
(4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
(5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
rip 0xffffe00069202690 info 83 0
(6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
(7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
ext_int: 0x00000000 ext_int_err: 0x00000000
(8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15

Which can be analyzed as follows:
(1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
a pending-interrupt of 0xd2.
(2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
(3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
notes that interrupt 0xd2 was reinjected and therefore calls
vmx_inject_irq() and returns. Without checking for pending L1 events!
Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
before calling inject_pending_event().
(4) L0 resumes L2 without immediate-exit even though there is a pending
L1 event (The IPI pending in LAPIC's IRR).

We have already reached the buggy scenario but events could be
furthered analyzed:
(5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
event-delivery.
(7) L0 decides to forward the VMExit to L1 for further handling.
(8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
LAPIC's IRR is not examined and therefore the IPI is still not delivered
into L1!

Signed-off-by: Liran Alon <liran.alon@xxxxxxxxxx>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@xxxxxxxxxx>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
---
  arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
  1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a4b5a013064b..63359ab0e798 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
         int r;

         /* try to reinject previous events if any */
-       if (vcpu->arch.exception.injected) {
-               kvm_x86_ops->queue_exception(vcpu);
-               return 0;
-       }

+       if (vcpu->arch.exception.injected)
+               kvm_x86_ops->queue_exception(vcpu);
         /*
          * NMI/interrupt must not be injected if an exception is
          * pending, because the latter may have been queued by
          * handling exit due to NMI/interrupt delivery.
          */
-       if (!vcpu->arch.exception.pending) {
-               if (vcpu->arch.nmi_injected) {
+       else if (!vcpu->arch.exception.pending) {
+               if (vcpu->arch.nmi_injected)
                         kvm_x86_ops->set_nmi(vcpu);
-                       return 0;
-               }
-
-               if (vcpu->arch.interrupt.injected) {
+               else if (vcpu->arch.interrupt.injected)
                         kvm_x86_ops->set_irq(vcpu);
-                       return 0;
-               }
         }

+       /*
+        * Call check_nested_events() even if we reinjected a previous event
+        * in order for caller to determine if it should require immediate-exit
+        * from L2 to L1 due to pending L1 events which require exit
+        * from L2 to L1.
+        */
         if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
                 r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
                 if (r != 0)
@@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
                                         vcpu->arch.exception.has_error_code,
                                         vcpu->arch.exception.error_code);

+               WARN_ON_ONCE(vcpu->arch.exception.injected);
                 vcpu->arch.exception.pending = false;
                 vcpu->arch.exception.injected = true;

@@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
                 }

                 kvm_x86_ops->queue_exception(vcpu);
-       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
+       }
+
+       /* Don't consider new event if we re-injected an event */
+       if (kvm_event_needs_reinjection(vcpu))
+               return 0;
+
+       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
+           kvm_x86_ops->smi_allowed(vcpu)) {
                 vcpu->arch.smi_pending = false;
                 enter_smm(vcpu);
         } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
--
1.9.1