On Tue, 2018-08-28 at 02:38 +0300, Liran Alon wrote: > > > On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: > > > > A VMX preemption timer value of '0' is guaranteed to cause a VMExit > > prior to the CPU executing any instructions in the guest. Use the > > preemption timer (if it's supported) to trigger immediate VMExit > > in place of the current method of sending a self-IPI. This ensures > > that pending VMExit injection to L1 occurs prior to executing any > > instructions in the guest (regardless of nesting level). > > > > When deferring VMExit injection, KVM generates an immediate VMExit > > from the (possibly nested) guest by sending itself an IPI. Because > > hardware interrupts are blocked prior to VMEnter and are unblocked > > (in hardware) after VMEnter, this results in taking a VMExit(INTR) > > before any guest instruction is executed. But, as this approach > > relies on the IPI being received before VMEnter executes, it only > > works as intended when KVM is running as L0. Because there are no > > architectural guarantees regarding when IPIs are delivered, when > > running nested the INTR may "arrive" long after L2 is running e.g. > > L0 KVM doesn't force an immediate switch to L1 to deliver an INTR. > > > > For the most part, this unintended delay is not an issue since the > > events being injected to L1 also do not have architectural guarantees > > regarding their timing. The notable exception is the VMX preemption > > timer[1], which is architecturally guaranteed to cause a VMExit prior > > to executing any instructions in the guest if the timer value is '0' > > at VMEnter. Specifically, the delay in injecting the VMExit causes > > the preemption timer KVM unit test to fail when run in a nested guest. > > > > Note: this approach is viable even on CPUs with a broken preemption > > timer, as broken in this context only means the timer counts at the > > wrong rate. There are no known errata affecting timer value of '0'. > > > > [1] I/O SMIs also have guarantees on when they arrive, but I have > > no idea if/how those are emulated in KVM. > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > > > This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running. > > The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE > which will eventually reach __apic_accept_irq() that will either, based on if APICv is active, > set vector in L1 LAPIC IRR and set KVM_REQ_EVENT > or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON. > > On next vcpu_enter_guest() one of the following will happen: > 1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false. > Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01. > 2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly. > In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1. > > Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02. > Therefore: > a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active, > CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter > into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit. > b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and > is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well. > Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated. > *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit. > > Therefore, even though from architectural perspective, this commit is correct, it workaround > some real nVMX event-injection bugs that should be solved. > What do you think? Did I miss something in my analysis? Hmm, IMO it's the other way around. Relying on an IPI to trigger an immediate VMExit is a bug, and setting KVM_REQ_EVENT is a workaround. There are simply no guarantees regarding the delivery time of IPIs. For example, setting KVM_REQ_EVENT at the end of nested_vmx_run() allows the preemption timer unit test to pass in L1, but it continues to fail in L2 (even if L0, L1 and L2 are all patched). At the time of VMEnter to L4 (from L0), L2 has done smp_send_reschedule() to self-IPI so that it can inject a VMExit to L3. But, L1 doesn't inject an event to L2 because its nested_run_pending is set and so there is no pending event visible to L0. Setting KVM_REQ_EVENT in L1 doesn't resolve the issue because L1 can't process that request until it gets control back, which doesn't happen until L4 takes a VMExit. On the other handing, using the preemption timer to trigger VMExit ensures L0 will see the pending event (for lack of a better term) at the time of VMEnter to L4 because the event is visible in both vmcs24 and vmcs14. Using this approach, the unit test passes in L1 and L2 if both are patched, even if L0 is unpatched (since the current self-IPI mechanism works when running on bare metal). So, setting KVM_REQ_EVENT won't cause problems, but it also doesn't truly fix the issue. I can tack on a patch to set KVM_REQ_EVENT in vmx_nested_run() if the VMX preemption timer isn't supported. That would partially workaround the issue on older hardware.