> On 28 Aug 2018, at 2:38, Liran Alon <liran.alon@xxxxxxxxxx> wrote: > > > >> On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: >> >> A VMX preemption timer value of '0' is guaranteed to cause a VMExit >> prior to the CPU executing any instructions in the guest. Use the >> preemption timer (if it's supported) to trigger immediate VMExit >> in place of the current method of sending a self-IPI. This ensures >> that pending VMExit injection to L1 occurs prior to executing any >> instructions in the guest (regardless of nesting level). >> >> When deferring VMExit injection, KVM generates an immediate VMExit >> from the (possibly nested) guest by sending itself an IPI. Because >> hardware interrupts are blocked prior to VMEnter and are unblocked >> (in hardware) after VMEnter, this results in taking a VMExit(INTR) >> before any guest instruction is executed. But, as this approach >> relies on the IPI being received before VMEnter executes, it only >> works as intended when KVM is running as L0. Because there are no >> architectural guarantees regarding when IPIs are delivered, when >> running nested the INTR may "arrive" long after L2 is running e.g. >> L0 KVM doesn't force an immediate switch to L1 to deliver an INTR. >> >> For the most part, this unintended delay is not an issue since the >> events being injected to L1 also do not have architectural guarantees >> regarding their timing. The notable exception is the VMX preemption >> timer[1], which is architecturally guaranteed to cause a VMExit prior >> to executing any instructions in the guest if the timer value is '0' >> at VMEnter. Specifically, the delay in injecting the VMExit causes >> the preemption timer KVM unit test to fail when run in a nested guest. >> >> Note: this approach is viable even on CPUs with a broken preemption >> timer, as broken in this context only means the timer counts at the >> wrong rate. There are no known errata affecting timer value of '0'. >> >> [1] I/O SMIs also have guarantees on when they arrive, but I have >> no idea if/how those are emulated in KVM. >> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> >> > > This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running. > > The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE > which will eventually reach __apic_accept_irq() that will either, based on if APICv is active, > set vector in L1 LAPIC IRR and set KVM_REQ_EVENT > or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON. > > On next vcpu_enter_guest() one of the following will happen: > 1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false. > Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01. > 2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly. > In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1. > > Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02. > Therefore: > a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active, > CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter > into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit. > b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and > is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well. > Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated. > *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit. > > Therefore, even though from architectural perspective, this commit is correct, it workaround > some real nVMX event-injection bugs that should be solved. > What do you think? Did I miss something in my analysis? > > -Liran I will also add that I think the above bugs can be fixed with a simple patch: We could just add a kvm_make_request(KVM_REQ_EVENT, vcpu); on successful run of nested_vmx_run(). This will cause vcpu_enter_guest() -> inject_pending_event() -> vmx_check_nested_events() to see that there is a pending interrupt for L1 and return -EBUSY (because nested_run_pending==true) which will force an immediate-exit which will set KVM_REQ_EVENT again and on next round of vcpu_enter_guest(), vmx_check_nested_events() will now VMExit from L2 to L1 on EXTERNAL_INTERRUPT as required. If you agree, I will submit such a patch. -Liran