> On 28 Aug 2018, at 17:46, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: > > On Tue, 2018-08-28 at 02:38 +0300, Liran Alon wrote: >> >>> On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: >>> >>> A VMX preemption timer value of '0' is guaranteed to cause a VMExit >>> prior to the CPU executing any instructions in the guest. Use the >>> preemption timer (if it's supported) to trigger immediate VMExit >>> in place of the current method of sending a self-IPI. This ensures >>> that pending VMExit injection to L1 occurs prior to executing any >>> instructions in the guest (regardless of nesting level). >>> >>> When deferring VMExit injection, KVM generates an immediate VMExit >>> from the (possibly nested) guest by sending itself an IPI. Because >>> hardware interrupts are blocked prior to VMEnter and are unblocked >>> (in hardware) after VMEnter, this results in taking a VMExit(INTR) >>> before any guest instruction is executed. But, as this approach >>> relies on the IPI being received before VMEnter executes, it only >>> works as intended when KVM is running as L0. Because there are no >>> architectural guarantees regarding when IPIs are delivered, when >>> running nested the INTR may "arrive" long after L2 is running e.g. >>> L0 KVM doesn't force an immediate switch to L1 to deliver an INTR. >>> >>> For the most part, this unintended delay is not an issue since the >>> events being injected to L1 also do not have architectural guarantees >>> regarding their timing. The notable exception is the VMX preemption >>> timer[1], which is architecturally guaranteed to cause a VMExit prior >>> to executing any instructions in the guest if the timer value is '0' >>> at VMEnter. Specifically, the delay in injecting the VMExit causes >>> the preemption timer KVM unit test to fail when run in a nested guest. >>> >>> Note: this approach is viable even on CPUs with a broken preemption >>> timer, as broken in this context only means the timer counts at the >>> wrong rate. There are no known errata affecting timer value of '0'. >>> >>> [1] I/O SMIs also have guarantees on when they arrive, but I have >>> no idea if/how those are emulated in KVM. >>> >>> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> >>> >> This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running. >> >> The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE >> which will eventually reach __apic_accept_irq() that will either, based on if APICv is active, >> set vector in L1 LAPIC IRR and set KVM_REQ_EVENT >> or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON. >> >> On next vcpu_enter_guest() one of the following will happen: >> 1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false. >> Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01. >> 2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly. >> In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1. >> >> Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02. >> Therefore: >> a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active, >> CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter >> into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit. >> b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and >> is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well. >> Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated. >> *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit. >> >> Therefore, even though from architectural perspective, this commit is correct, it workaround >> some real nVMX event-injection bugs that should be solved. >> What do you think? Did I miss something in my analysis? > > Hmm, IMO it's the other way around. Relying on an IPI to trigger an > immediate VMExit is a bug, and setting KVM_REQ_EVENT is a workaround. > There are simply no guarantees regarding the delivery time of IPIs. > > For example, setting KVM_REQ_EVENT at the end of nested_vmx_run() > allows the preemption timer unit test to pass in L1, but it continues > to fail in L2 (even if L0, L1 and L2 are all patched). At the time of > VMEnter to L4 (from L0), L2 has done smp_send_reschedule() to self-IPI > so that it can inject a VMExit to L3. But, L1 doesn't inject an event > to L2 because its nested_run_pending is set and so there is no pending > event visible to L0. Setting KVM_REQ_EVENT in L1 doesn't resolve the > issue because L1 can't process that request until it gets control back, > which doesn't happen until L4 takes a VMExit. I’m not sure I was able to follow your scenario. If vCPU is currently running L2, L1 by definition have nested_run_pending=false as it is set to true only when L2 enters into L3 or L4 (via vmresume/vmlaunch). When L2 will trigger this self-IPI, it will exit to L0 which will reflect exit to L1. At that point, even if L1 nested_run_pending was set to true, it will now be cleared, and L1 will queue a pending interrupt for L2 but because L2 is now running with interrupts-disabled, L1 won’t inject an interrupt but instead will request an irq window on vmcs12. When L2 will later vmresume back into L4 (after requesting immediate-exit via self-IPI), the vmresume will exit to L0 which will reflect exit to L1 which will set KVM_REQ_EVENT (as I suggested in previous email). Therefore, before L1 will vmresume with vmcs14, it will see L2 has pending interrupt but nested_run_pending=true and therefore will send a self-IPI. This will cause L0 to queue a pending interrupt for L1 and request an irq window in vmcs01. Then, when L1 will vmresume with vmcs14, L0 will set KVM_REQ_EVENT (as my patch suggests) which will make L0 to request an immediate-exit after vmresume with vmcs04. Therefore, before CPU executes any instruction of L4, it will immediately exit to L0 which will exit on EXTERNAL_INTERRUPT to L1 which will now in turn exit on EXTERNAL_INTERRUPT to L2 as required. > > On the other handing, using the preemption timer to trigger VMExit > ensures L0 will see the pending event (for lack of a better term) > at the time of VMEnter to L4 because the event is visible in both > vmcs24 and vmcs14. Using this approach, the unit test passes in > L1 and L2 if both are patched, even if L0 is unpatched (since the > current self-IPI mechanism works when running on bare metal). > > So, setting KVM_REQ_EVENT won't cause problems, but it also doesn't > truly fix the issue. I can tack on a patch to set KVM_REQ_EVENT in > vmx_nested_run() if the VMX preemption timer isn't supported. That > would partially workaround the issue on older hardware. I believe the patch I suggested of setting KVM_REQ_EVENT at vmx_nested_run() fixes an issue which is orthogonal to your patch. It fixes an issue that if there is a pending interrupt to L1 while L1 is running with Interrupt-disabled and L1 executes vmlaunch/vmresume, L0 should re-evaluate if it should exit from L2 to L1 because of a pending L1 interrupt. This is just a general issue and the L1 interrupt shouldn’t be delayed until the next point that KVM will happen to set KVM_REQ_EVENT. If you agree with above analysis, I will submit such a patch, unrelated to your patch series. -Liran