Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending

Liran Alon <LIRAN.ALON@xxxxxxxxxx> · Wed, 27 Dec 2017 17:15:29 +0200

On 27/12/17 14:55, Paolo Bonzini wrote:
On 27/12/2017 11:51, Liran Alon wrote:
Maybe I misunderstand you, but I don't think this is true.
complete_nested_posted_interrupt() relies on being called at a very
specific call-site: Right before VMEntry and after interrupts are
disabled. It works by issue a self-IPI to cause CPU to actually process
the nested posted-interrupts.

In addition, I don't see how we can utilize the fact that
kvm_cpu_has_interrupt() calls apic_has_interrupt_for_ppr() as
vmx.nested.pi_desc->pir should never be synced into L0 vLAPIC IRR.
In contrast to vmx.pi_desc->pir which does after sync_pir_to_irr().

You're right, the callbacks must stay.  What confused me is the
reference to pi_test_on in vmx_cpu_has_nested_posted_interrupt.  While
your patches are an improvement, there is still some confusion on what
leads to injecting a nested posted interrupt, and on the split between
"deliver" and "complete":

- checking pi_test_on should only be done when consuming the L2 PI
descriptor (which in your case is done by the processor via the
self-IPI).  So you don't need to test it before sending the self-IPI.

I agree we could have not check it. It is done as an optimization to 
avoid sending a self-IPI when it won't do anything because the 
vmx.nested->pi_desc.control ON bit is off.

This optimization is even more important when this code runs in L1 
(think about triple-virtualization ^_^) to avoid a #VMExit on write to 
LAPIC ICR.

- vmx->nested.pi_pending should not be set in
vmx_deliver_nested_posted_interrupt after patch 9.  pi_pending should
only be set by the nested PI handler, when it is invoked outside guest
mode.  It is then consumed before doing the self-IPI as in the above sketch.

I disagree.
vmx->nested.pi_pending is used as a signal to know L1 has indeed sent a 
vmcs12->posted_intr_nv IPI. If vmx_deliver_nested_posted_interrupt() 
will see target vCPU thread is OUTSIDE_GUEST_MODE then it won't send a 
physical IPI but instead only kick vCPU. In this case, the only way the 
target vCPU thread knows it should trigger self-IPI is by the fact 
pi_pending was set by vmx_deliver_nested_posted_interrupt().
(As in this case, the nested PI handler in host was never invoked).

(It cannot only rely on the vmx.nested->pi_desc.control ON bit as it may 
be set by L1 without L1 triggering a vmcs12->posted_intr_nv IPI)

- likewise, vmx_cpu_has_nested_posted_interrupt should only check
vmx->nested.pi_pending.  I don't think it should even check
is_guest_mode(vcpu) or vmx->nested.pi_desc; compare with the handling of
KVM_REQ_NMI, it doesn't test nmi_allowed.  This leads me to the next point.

There is an interesting question on what should happen when
(is_guest_mode(vcpu) && vmx->nested.pi_pending && 
!pi_test_on(vmx->nested.pi_desc)). I would expect L2 guest to not be 
waken from it's HLT as virtual-interrupt-delivery should not see any 
pending interrupt to inject. But I would expect that to also happen when 
PIR is empty (even if ON bit is set) which is not checked in this 
condition...

So I think I agree but we may have a small semantic issue here of waking 
the L2 vCPU unnecessarily. But this should be a very rare case so maybe 
handle it later (Write TODO comment).

- vmx_complete_nested_posted_interrupt *must* find a valid descriptor.
Even after a vmexit, because vmx_complete_nested_posted_interrupt is the
very first thing that runs and L1 has had no time to run a vmwrite.  So

I agree. It is better to WARN in case descriptor is not valid.

the above could become:

	if (is_guest_mode(vcpu) && vmx->nested.pi_pending) {
		vmx->nested.pi_pending = false;
		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
				    POSTED_INTR_NESTED_VECTOR);
	}

and then what about that is_guest_mode check?  When L1 sent the PI
notification, it thought that the destination VCPU was in guest mode.
Upon vmexit, L1 can expect to either see ON=0 or get the notification
interrupt itself.  So either we deliver the interrupt to L1, or we must
do L2 virtual interrupt delivery before L1 even runs.  Hence IMO the
is_guest_mode is wrong, which unfortunately throws a wrench somewhat in
the self-IPI idea quite a bit.

I agree with moving vmx_complete_nested_posted_interrupt out of
check_nested_events, and also with the simplification of
vmx_hwapic_irr_update.  However, in my opinion the self-IPI is actually
complicating things.  Once the handling of vmx->nested.pi_pending is
cleaned up (the nested PI interrupt handler in patch 9 is a great
improvement in that respect), we can move
vmx_complete_nested_posted_interrupt to a new KVM_REQ_NESTED_PI request
that replaces vmx->nested.pi_pending.  The GUEST_INTR_STATUS write can
be done in the vmcs12, and will then be copied by prepare_vmcs02 to the
vmcs02.

I may be wrong, but based on my experience cleaning up the bare-metal
APICv, I suspect that both the current code and this version are overly
complicated, and there is a nice and simple core waiting to be
extracted.  Let's fix the race condition in the simple way, and
separately focus on cleaning up pi_pending and everything that follows
from there.

Interesting.

I think I now follow what you mean regarding cleaning logic around 
pi_pending. This is how I understand it:

1. "vmx->nested.pi_pending" is a flag used to indicate "L1 has sent a 
vmcs12->posted_intr_nv IPI". That's it.

2. Currently code is a bit weird in the sense that instead of signal the 
pending IPI in virtual LAPIC IRR, we set it in a special variable.
If we would have set it in virtual LAPIC IRR, we could in theory behave 
very similar to a standard CPU. At interrupt injection point, we could:
(a) If vCPU is in root-mode: Just inject the pending interrupt normally.
(b) If vCPU is in non-root-mode and posted-interrupts feature is active, 
then instead of injecting the pending interrupt, we should simulate 
processing of posted-interrupts.

3. The processing of the nested posted-interrupts itself can still be 
done in self-IPI mechanism.

4. Because not doing (2), there is still currently an issue that L1 
doesn't receive a vmcs12->posted_intr_nv interrupt when target vCPU 
thread has exited from L2 to L1 and pi_pending=true.

Do we agree on the above? Or am I still misunderstanding something?
If we agree, I will attempt to clean code to handle it like described above.

Regards,
-Liran

Paolo