On 27/12/17 14:55, Paolo Bonzini wrote:
On 27/12/2017 11:51, Liran Alon wrote:
Maybe I misunderstand you, but I don't think this is true.
complete_nested_posted_interrupt() relies on being called at a very
specific call-site: Right before VMEntry and after interrupts are
disabled. It works by issue a self-IPI to cause CPU to actually process
the nested posted-interrupts.
In addition, I don't see how we can utilize the fact that
kvm_cpu_has_interrupt() calls apic_has_interrupt_for_ppr() as
vmx.nested.pi_desc->pir should never be synced into L0 vLAPIC IRR.
In contrast to vmx.pi_desc->pir which does after sync_pir_to_irr().
You're right, the callbacks must stay. What confused me is the
reference to pi_test_on in vmx_cpu_has_nested_posted_interrupt. While
your patches are an improvement, there is still some confusion on what
leads to injecting a nested posted interrupt, and on the split between
"deliver" and "complete":
- checking pi_test_on should only be done when consuming the L2 PI
descriptor (which in your case is done by the processor via the
self-IPI). So you don't need to test it before sending the self-IPI.
I agree we could have not check it. It is done as an optimization to
avoid sending a self-IPI when it won't do anything because the
vmx.nested->pi_desc.control ON bit is off.
This optimization is even more important when this code runs in L1
(think about triple-virtualization ^_^) to avoid a #VMExit on write to
LAPIC ICR.
- vmx->nested.pi_pending should not be set in
vmx_deliver_nested_posted_interrupt after patch 9. pi_pending should
only be set by the nested PI handler, when it is invoked outside guest
mode. It is then consumed before doing the self-IPI as in the above sketch.
I disagree.
vmx->nested.pi_pending is used as a signal to know L1 has indeed sent a
vmcs12->posted_intr_nv IPI. If vmx_deliver_nested_posted_interrupt()
will see target vCPU thread is OUTSIDE_GUEST_MODE then it won't send a
physical IPI but instead only kick vCPU. In this case, the only way the
target vCPU thread knows it should trigger self-IPI is by the fact
pi_pending was set by vmx_deliver_nested_posted_interrupt().
(As in this case, the nested PI handler in host was never invoked).
(It cannot only rely on the vmx.nested->pi_desc.control ON bit as it may
be set by L1 without L1 triggering a vmcs12->posted_intr_nv IPI)
- likewise, vmx_cpu_has_nested_posted_interrupt should only check
vmx->nested.pi_pending. I don't think it should even check
is_guest_mode(vcpu) or vmx->nested.pi_desc; compare with the handling of
KVM_REQ_NMI, it doesn't test nmi_allowed. This leads me to the next point.
There is an interesting question on what should happen when
(is_guest_mode(vcpu) && vmx->nested.pi_pending &&
!pi_test_on(vmx->nested.pi_desc)). I would expect L2 guest to not be
waken from it's HLT as virtual-interrupt-delivery should not see any
pending interrupt to inject. But I would expect that to also happen when
PIR is empty (even if ON bit is set) which is not checked in this
condition...
So I think I agree but we may have a small semantic issue here of waking
the L2 vCPU unnecessarily. But this should be a very rare case so maybe
handle it later (Write TODO comment).
- vmx_complete_nested_posted_interrupt *must* find a valid descriptor.
Even after a vmexit, because vmx_complete_nested_posted_interrupt is the
very first thing that runs and L1 has had no time to run a vmwrite. So
I agree. It is better to WARN in case descriptor is not valid.
the above could become:
if (is_guest_mode(vcpu) && vmx->nested.pi_pending) {
vmx->nested.pi_pending = false;
apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
POSTED_INTR_NESTED_VECTOR);
}
and then what about that is_guest_mode check? When L1 sent the PI
notification, it thought that the destination VCPU was in guest mode.
Upon vmexit, L1 can expect to either see ON=0 or get the notification
interrupt itself. So either we deliver the interrupt to L1, or we must
do L2 virtual interrupt delivery before L1 even runs. Hence IMO the
is_guest_mode is wrong, which unfortunately throws a wrench somewhat in
the self-IPI idea quite a bit.
I agree with moving vmx_complete_nested_posted_interrupt out of
check_nested_events, and also with the simplification of
vmx_hwapic_irr_update. However, in my opinion the self-IPI is actually
complicating things. Once the handling of vmx->nested.pi_pending is
cleaned up (the nested PI interrupt handler in patch 9 is a great
improvement in that respect), we can move
vmx_complete_nested_posted_interrupt to a new KVM_REQ_NESTED_PI request
that replaces vmx->nested.pi_pending. The GUEST_INTR_STATUS write can
be done in the vmcs12, and will then be copied by prepare_vmcs02 to the
vmcs02.
I may be wrong, but based on my experience cleaning up the bare-metal
APICv, I suspect that both the current code and this version are overly
complicated, and there is a nice and simple core waiting to be
extracted. Let's fix the race condition in the simple way, and
separately focus on cleaning up pi_pending and everything that follows
from there.
Interesting.
I think I now follow what you mean regarding cleaning logic around
pi_pending. This is how I understand it:
1. "vmx->nested.pi_pending" is a flag used to indicate "L1 has sent a
vmcs12->posted_intr_nv IPI". That's it.
2. Currently code is a bit weird in the sense that instead of signal the
pending IPI in virtual LAPIC IRR, we set it in a special variable.
If we would have set it in virtual LAPIC IRR, we could in theory behave
very similar to a standard CPU. At interrupt injection point, we could:
(a) If vCPU is in root-mode: Just inject the pending interrupt normally.
(b) If vCPU is in non-root-mode and posted-interrupts feature is active,
then instead of injecting the pending interrupt, we should simulate
processing of posted-interrupts.
3. The processing of the nested posted-interrupts itself can still be
done in self-IPI mechanism.
4. Because not doing (2), there is still currently an issue that L1
doesn't receive a vmcs12->posted_intr_nv interrupt when target vCPU
thread has exited from L2 to L1 and pi_pending=true.
Do we agree on the above? Or am I still misunderstanding something?
If we agree, I will attempt to clean code to handle it like described above.
Regards,
-Liran
Paolo