> On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: > > A VMX preemption timer value of '0' is guaranteed to cause a VMExit > prior to the CPU executing any instructions in the guest. Use the > preemption timer (if it's supported) to trigger immediate VMExit > in place of the current method of sending a self-IPI. This ensures > that pending VMExit injection to L1 occurs prior to executing any > instructions in the guest (regardless of nesting level). > > When deferring VMExit injection, KVM generates an immediate VMExit > from the (possibly nested) guest by sending itself an IPI. Because > hardware interrupts are blocked prior to VMEnter and are unblocked > (in hardware) after VMEnter, this results in taking a VMExit(INTR) > before any guest instruction is executed. But, as this approach > relies on the IPI being received before VMEnter executes, it only > works as intended when KVM is running as L0. Because there are no > architectural guarantees regarding when IPIs are delivered, when > running nested the INTR may "arrive" long after L2 is running e.g. > L0 KVM doesn't force an immediate switch to L1 to deliver an INTR. > > For the most part, this unintended delay is not an issue since the > events being injected to L1 also do not have architectural guarantees > regarding their timing. The notable exception is the VMX preemption > timer[1], which is architecturally guaranteed to cause a VMExit prior > to executing any instructions in the guest if the timer value is '0' > at VMEnter. Specifically, the delay in injecting the VMExit causes > the preemption timer KVM unit test to fail when run in a nested guest. > > Note: this approach is viable even on CPUs with a broken preemption > timer, as broken in this context only means the timer counts at the > wrong rate. There are no known errata affecting timer value of '0'. > > [1] I/O SMIs also have guarantees on when they arrive, but I have > no idea if/how those are emulated in KVM. > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running. The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE which will eventually reach __apic_accept_irq() that will either, based on if APICv is active, set vector in L1 LAPIC IRR and set KVM_REQ_EVENT or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON. On next vcpu_enter_guest() one of the following will happen: 1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false. Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01. 2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly. In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1. Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02. Therefore: a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active, CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit. b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well. Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit. Therefore, even though from architectural perspective, this commit is correct, it workaround some real nVMX event-injection bugs that should be solved. What do you think? Did I miss something in my analysis? -Liran