Re: [PATCH 3/3] KVM: VMX: use preemption timer to force immediate VMExit

Liran Alon <liran.alon@xxxxxxxxxx> · Wed, 29 Aug 2018 12:57:34 +0300

> On 28 Aug 2018, at 17:46, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
> 
> On Tue, 2018-08-28 at 02:38 +0300, Liran Alon wrote:
>> 
>>> On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
>>> 
>>> A VMX preemption timer value of '0' is guaranteed to cause a VMExit
>>> prior to the CPU executing any instructions in the guest.  Use the
>>> preemption timer (if it's supported) to trigger immediate VMExit
>>> in place of the current method of sending a self-IPI.  This ensures
>>> that pending VMExit injection to L1 occurs prior to executing any
>>> instructions in the guest (regardless of nesting level).
>>> 
>>> When deferring VMExit injection, KVM generates an immediate VMExit
>>> from the (possibly nested) guest by sending itself an IPI.  Because
>>> hardware interrupts are blocked prior to VMEnter and are unblocked
>>> (in hardware) after VMEnter, this results in taking a VMExit(INTR)
>>> before any guest instruction is executed.  But, as this approach
>>> relies on the IPI being received before VMEnter executes, it only
>>> works as intended when KVM is running as L0.  Because there are no
>>> architectural guarantees regarding when IPIs are delivered, when
>>> running nested the INTR may "arrive" long after L2 is running e.g.
>>> L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
>>> 
>>> For the most part, this unintended delay is not an issue since the
>>> events being injected to L1 also do not have architectural guarantees
>>> regarding their timing.  The notable exception is the VMX preemption
>>> timer[1], which is architecturally guaranteed to cause a VMExit prior
>>> to executing any instructions in the guest if the timer value is '0'
>>> at VMEnter.  Specifically, the delay in injecting the VMExit causes
>>> the preemption timer KVM unit test to fail when run in a nested guest.
>>> 
>>> Note: this approach is viable even on CPUs with a broken preemption
>>> timer, as broken in this context only means the timer counts at the
>>> wrong rate.  There are no known errata affecting timer value of '0'.
>>> 
>>> [1] I/O SMIs also have guarantees on when they arrive, but I have
>>>    no idea if/how those are emulated in KVM.
>>> 
>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
>>> 
>> This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running.
>> 
>> The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE
>> which will eventually reach __apic_accept_irq() that will either, based on if APICv is active,
>> set vector in L1 LAPIC IRR and set KVM_REQ_EVENT
>> or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON.
>> 
>> On next vcpu_enter_guest() one of the following will happen:
>> 1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false.
>> Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01.
>> 2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly.
>> In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1.
>> 
>> Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02.
>> Therefore:
>> a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active,
>> CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter
>> into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
>> b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and
>> is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well.
>> Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated.
>> *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
>> 
>> Therefore, even though from architectural perspective, this commit is correct, it workaround
>> some real nVMX event-injection bugs that should be solved.
>> What do you think? Did I miss something in my analysis?
> 
> Hmm, IMO it's the other way around.  Relying on an IPI to trigger an
> immediate VMExit is a bug, and setting KVM_REQ_EVENT is a workaround.
> There are simply no guarantees regarding the delivery time of IPIs.
> 
> For example, setting KVM_REQ_EVENT at the end of nested_vmx_run()
> allows the preemption timer unit test to pass in L1, but it continues
> to fail in L2 (even if L0, L1 and L2 are all patched).  At the time of
> VMEnter to L4 (from L0), L2 has done smp_send_reschedule() to self-IPI
> so that it can inject a VMExit to L3.  But, L1 doesn't inject an event
> to L2 because its nested_run_pending is set and so there is no pending
> event visible to L0.  Setting KVM_REQ_EVENT in L1 doesn't resolve the
> issue because L1 can't process that request until it gets control back,
> which doesn't happen until L4 takes a VMExit.

I’m not sure I was able to follow your scenario.
If vCPU is currently running L2, L1 by definition have nested_run_pending=false as it is set to true
only when L2 enters into L3 or L4 (via vmresume/vmlaunch).
When L2 will trigger this self-IPI, it will exit to L0 which will reflect exit to L1. At that point, even if
L1 nested_run_pending was set to true, it will now be cleared, and L1 will queue a pending interrupt for L2
but because L2 is now running with interrupts-disabled, L1 won’t inject an interrupt but instead
will request an irq window on vmcs12.
When L2 will later vmresume back into L4 (after requesting immediate-exit via self-IPI), the vmresume will
exit to L0 which will reflect exit to L1 which will set KVM_REQ_EVENT (as I suggested in previous email).
Therefore, before L1 will vmresume with vmcs14, it will see L2 has pending interrupt but nested_run_pending=true
and therefore will send a self-IPI. This will cause L0 to queue a pending interrupt for L1 and request an irq window in vmcs01.
Then, when L1 will vmresume with vmcs14, L0 will set KVM_REQ_EVENT (as my patch suggests) which will make L0
to request an immediate-exit after vmresume with vmcs04.
Therefore, before CPU executes any instruction of L4, it will immediately exit to L0 which will exit
on EXTERNAL_INTERRUPT to L1 which will now in turn exit on EXTERNAL_INTERRUPT to L2 as required.

> 
> On the other handing, using the preemption timer to trigger VMExit
> ensures L0 will see the pending event (for lack of a better term)
> at the time of VMEnter to L4 because the event is visible in both
> vmcs24 and vmcs14.  Using this approach, the unit test passes in
> L1 and L2 if both are patched, even if L0 is unpatched (since the
> current self-IPI mechanism works when running on bare metal).
> 
> So, setting KVM_REQ_EVENT won't cause problems, but it also doesn't
> truly fix the issue.  I can tack on a patch to set KVM_REQ_EVENT in
> vmx_nested_run() if the VMX preemption timer isn't supported.  That
> would partially workaround the issue on older hardware.

I believe the patch I suggested of setting KVM_REQ_EVENT at vmx_nested_run() fixes an issue which is
orthogonal to your patch. It fixes an issue that if there is a pending interrupt to L1 while L1 is running with
Interrupt-disabled and L1 executes vmlaunch/vmresume, L0 should re-evaluate if it should exit from L2 to L1
because of a pending L1 interrupt. This is just a general issue and the L1 interrupt shouldn’t be delayed until
the next point that KVM will happen to set KVM_REQ_EVENT.

If you agree with above analysis, I will submit such a patch, unrelated to your patch series.

-Liran