On 27/12/2017 11:51, Liran Alon wrote: > Maybe I misunderstand you, but I don't think this is true. > complete_nested_posted_interrupt() relies on being called at a very > specific call-site: Right before VMEntry and after interrupts are > disabled. It works by issue a self-IPI to cause CPU to actually process > the nested posted-interrupts. > > In addition, I don't see how we can utilize the fact that > kvm_cpu_has_interrupt() calls apic_has_interrupt_for_ppr() as > vmx.nested.pi_desc->pir should never be synced into L0 vLAPIC IRR. > In contrast to vmx.pi_desc->pir which does after sync_pir_to_irr(). You're right, the callbacks must stay. What confused me is the reference to pi_test_on in vmx_cpu_has_nested_posted_interrupt. While your patches are an improvement, there is still some confusion on what leads to injecting a nested posted interrupt, and on the split between "deliver" and "complete": - checking pi_test_on should only be done when consuming the L2 PI descriptor (which in your case is done by the processor via the self-IPI). So you don't need to test it before sending the self-IPI. - vmx->nested.pi_pending should not be set in vmx_deliver_nested_posted_interrupt after patch 9. pi_pending should only be set by the nested PI handler, when it is invoked outside guest mode. It is then consumed before doing the self-IPI as in the above sketch. - likewise, vmx_cpu_has_nested_posted_interrupt should only check vmx->nested.pi_pending. I don't think it should even check is_guest_mode(vcpu) or vmx->nested.pi_desc; compare with the handling of KVM_REQ_NMI, it doesn't test nmi_allowed. This leads me to the next point. - vmx_complete_nested_posted_interrupt *must* find a valid descriptor. Even after a vmexit, because vmx_complete_nested_posted_interrupt is the very first thing that runs and L1 has had no time to run a vmwrite. So the above could become: if (is_guest_mode(vcpu) && vmx->nested.pi_pending) { vmx->nested.pi_pending = false; apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), POSTED_INTR_NESTED_VECTOR); } and then what about that is_guest_mode check? When L1 sent the PI notification, it thought that the destination VCPU was in guest mode. Upon vmexit, L1 can expect to either see ON=0 or get the notification interrupt itself. So either we deliver the interrupt to L1, or we must do L2 virtual interrupt delivery before L1 even runs. Hence IMO the is_guest_mode is wrong, which unfortunately throws a wrench somewhat in the self-IPI idea quite a bit. I agree with moving vmx_complete_nested_posted_interrupt out of check_nested_events, and also with the simplification of vmx_hwapic_irr_update. However, in my opinion the self-IPI is actually complicating things. Once the handling of vmx->nested.pi_pending is cleaned up (the nested PI interrupt handler in patch 9 is a great improvement in that respect), we can move vmx_complete_nested_posted_interrupt to a new KVM_REQ_NESTED_PI request that replaces vmx->nested.pi_pending. The GUEST_INTR_STATUS write can be done in the vmcs12, and will then be copied by prepare_vmcs02 to the vmcs02. I may be wrong, but based on my experience cleaning up the bare-metal APICv, I suspect that both the current code and this version are overly complicated, and there is a nice and simple core waiting to be extracted. Let's fix the race condition in the simple way, and separately focus on cleaning up pi_pending and everything that follows from there. Paolo