Re: [PATCH 3/3] KVM: VMX: use preemption timer to force immediate VMExit

Liran Alon <liran.alon@xxxxxxxxxx> · Tue, 28 Aug 2018 02:52:53 +0300

> On 28 Aug 2018, at 2:38, Liran Alon <liran.alon@xxxxxxxxxx> wrote:
> 
> 
> 
>> On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
>> 
>> A VMX preemption timer value of '0' is guaranteed to cause a VMExit
>> prior to the CPU executing any instructions in the guest.  Use the
>> preemption timer (if it's supported) to trigger immediate VMExit
>> in place of the current method of sending a self-IPI.  This ensures
>> that pending VMExit injection to L1 occurs prior to executing any
>> instructions in the guest (regardless of nesting level).
>> 
>> When deferring VMExit injection, KVM generates an immediate VMExit
>> from the (possibly nested) guest by sending itself an IPI.  Because
>> hardware interrupts are blocked prior to VMEnter and are unblocked
>> (in hardware) after VMEnter, this results in taking a VMExit(INTR)
>> before any guest instruction is executed.  But, as this approach
>> relies on the IPI being received before VMEnter executes, it only
>> works as intended when KVM is running as L0.  Because there are no
>> architectural guarantees regarding when IPIs are delivered, when
>> running nested the INTR may "arrive" long after L2 is running e.g.
>> L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
>> 
>> For the most part, this unintended delay is not an issue since the
>> events being injected to L1 also do not have architectural guarantees
>> regarding their timing.  The notable exception is the VMX preemption
>> timer[1], which is architecturally guaranteed to cause a VMExit prior
>> to executing any instructions in the guest if the timer value is '0'
>> at VMEnter.  Specifically, the delay in injecting the VMExit causes
>> the preemption timer KVM unit test to fail when run in a nested guest.
>> 
>> Note: this approach is viable even on CPUs with a broken preemption
>> timer, as broken in this context only means the timer counts at the
>> wrong rate.  There are no known errata affecting timer value of '0'.
>> 
>> [1] I/O SMIs also have guarantees on when they arrive, but I have
>>   no idea if/how those are emulated in KVM.
>> 
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
>> 
> 
> This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running.
> 
> The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE
> which will eventually reach __apic_accept_irq() that will either, based on if APICv is active,
> set vector in L1 LAPIC IRR and set KVM_REQ_EVENT
> or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON.
> 
> On next vcpu_enter_guest() one of the following will happen:
> 1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false.
> Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01.
> 2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly.
> In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1.
> 
> Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02.
> Therefore:
> a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active,
> CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter
> into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
> b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and
> is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well.
> Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated.
> *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
> 
> Therefore, even though from architectural perspective, this commit is correct, it workaround
> some real nVMX event-injection bugs that should be solved.
> What do you think? Did I miss something in my analysis?
> 
> -Liran

I will also add that I think the above bugs can be fixed with a simple patch:
We could just add a kvm_make_request(KVM_REQ_EVENT, vcpu); on successful run of nested_vmx_run().

This will cause vcpu_enter_guest() -> inject_pending_event() -> vmx_check_nested_events() to see that
there is a pending interrupt for L1 and return -EBUSY (because nested_run_pending==true) which will force an
immediate-exit which will set KVM_REQ_EVENT again and on next round of vcpu_enter_guest(),
vmx_check_nested_events() will now VMExit from L2 to L1 on EXTERNAL_INTERRUPT as required.

If you agree, I will submit such a patch.

-Liran