Re: [PATCH v2 4/5] KVM: nVMX: Deliver missed nested-PI notification-vector via self-IPI while interrupts disabled

Radim Krčmář <rkrcmar@xxxxxxxxxx> · Wed, 6 Dec 2017 21:45:52 +0100

2017-12-05 10:16+0200, Liran Alon:
> When L1 wants to send a posted-interrupt to another L1 CPU running L2,
> it sets the relevant bit in vmx->nested.pi_desc->pir and ON bit in
> vmx->nested.pi_desc->control. Then it attempts to send a
> notification-vector IPI to dest L1 CPU.
> This attempt to send IPI will exit to L0 which will reach
> vmx_deliver_nested_posted_interrupt() which does the following:
> 1. If dest L0 CPU is currently running guest (vcpu->mode ==
> IN_GUEST_MODE), it sends a physical IPI of PI nested-notification-vector.
> 2. It sets KVM_REQ_EVENT in dest vCPU. This is done such that if dest
> L0 CPU exits from guest to host and therefore doesn't recognize physical
> IPI (or it wasn't sent), then KVM_REQ_EVENT will be consumed on next
> vmentry which will call vmx_check_nested_events() which should call
> (in theory) vmx_complete_nested_posted_interrupt(). That function
> should see vmx->nested.pi_desc->control ON bit is set and therefore
> "emulate" posted-interrupt delivery for L1 (sync PIR to IRR in L1
> virtual-apic-page & update vmcs02 RVI).
> 
> The above logic regarding nested-posted-interrupts contains multiple
> issues:
> 
> A) Race-condition:
> On step (1) it is possible sender will see vcpu->mode == IN_GUEST_MODE
> but before sending physical IPI, the dest CPU will exit to host.
> Therefore, physical IPI could be received at host which it's hanlder
> does nothing. In addition, assume that dest CPU passes the checks
> for pending kvm requests before sender sets KVM_REQ_EVENT. Therefore,
> dest CPU will resume L2 without evaluating nested-posted-interrupts.
> 
> B) vmx_complete_nested_posted_interrupt() is not always called
> when needed. Even if dest CPU consumed KVM_REQ_EVENT, there is a bug
> that vmx_check_nested_events() could exit from L2 to L1 before calling
> vmx_complete_nested_posted_interrupt(). Therefore, on next resume of
> L1 into L2, nested-posted-interrupts won't be evaluated even though L0
> resume L2 (We may resume L2 without having KVM_REQ_EVENT set).
> 
> This commit removes nested-posted-interrupts processing from
> check_nested_events() and instead makes sure to process
> nested-posted-interrupts on vmentry after interrupts
> disabled. Processing of nested-posted-interrupts is delegated
> to hardware by issueing a self-IPI of relevant notification-vector
> which will be delivered to CPU when CPU is in guest.
> 
> * Bug (A) is solved by the fact that processing of
> nested-posted-interrupts is not dependent on KVM_REQ_EVENT and
> happens before every vmentry to L2.
> * Bug (B) is now trivially solved by processing
> nested-posted-interrupts before each vmentry to L2 guest.
> 
> An alternative could have been to just call
> vmx_complete_nested_posted_interrupt() at this call-site. However, we
> have decided to go with this approach because:
> 1. It would require modifying vmx_complete_nested_posted_interrupt()
> to be able to work with interrupts disabled (not-trivial).
> 2. We preffer to avoid software-emulation of hardware behavior if it
> is possible.

Nice solution!  Self-IPI was slower on non-nested, but even if it is
slower here, the code saving is definitely worth it.

> Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
> 
> Signed-off-by: Liran Alon <liran.alon@xxxxxxxxxx>
> Co-authored-by: Nikita Leshenko <nikita.leshchenko@xxxxxxxxxx>
> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx>
> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -5156,6 +5114,17 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
>  		kvm_vcpu_kick(vcpu);
>  }
>  
> +static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	WARN_ON(!vcpu->arch.apicv_active);
> +
> +	if (is_guest_mode(vcpu) && vmx->nested.pi_desc &&
> +	    pi_test_on(vmx->nested.pi_desc))
> +		kvm_vcpu_trigger_posted_interrupt(vcpu, true);

We're always sending to self, so the function would be better with
apic->send_IPI_self() as it might be accelerated by hardware.

> +}
> +
>  /*
>   * Set up the vmcs's constant host-state fields, i.e., host-state fields that
>   * will not change in the lifetime of the guest.
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -6962,12 +6962,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	smp_mb__after_srcu_read_unlock();
>  
>  	/*
> -	 * This handles the case where a posted interrupt was
> -	 * notified with kvm_vcpu_kick.
> +	 * In case guest got the posted-interrupt notification
> +	 * vector while running in host, we need to make sure
> +	 * it arrives to guest.
> +	 * For L1 posted-interrupts, we manually sync PIR to IRR.
> +	 * For L2 posted-interrupts, we send notification-vector
> +	 * again by self IPI such that it will now be received in guest.
>  	 */
> -	if (kvm_lapic_enabled(vcpu)) {
> -		if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active)
> +	if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) {
> +		if (kvm_x86_ops->sync_pir_to_irr)
>  			kvm_x86_ops->sync_pir_to_irr(vcpu);
> +		if (kvm_x86_ops->complete_nested_posted_interrupt)
> +			kvm_x86_ops->complete_nested_posted_interrupt(vcpu);

I think that vcpu->arch.apicv_active is enough to guarantee presence of
both sync_pir_to_irr and complete_nested_posted_interrupt,

and considering this is a hot path, I'd check is_guest_mode(vcpu) before
calling the second function.

(Saving the function call with a new x86_op for those two seems viable
 too, but would need deeper analysis and benchmarking as it is putting
 more code into hot cache ...)

Thanks.