Re: [PATCH v2 4/5] KVM: nVMX: Deliver missed nested-PI notification-vector via self-IPI while interrupts disabled

Wincy Van <fanwenyi0529@xxxxxxxxx> · Tue, 5 Dec 2017 16:36:44 +0800

Good idea !
Using self-ipi is much more simple and efficient than emulation.

On Tue, Dec 5, 2017 at 4:16 PM, Liran Alon <liran.alon@xxxxxxxxxx> wrote:
> When L1 wants to send a posted-interrupt to another L1 CPU running L2,
> it sets the relevant bit in vmx->nested.pi_desc->pir and ON bit in
> vmx->nested.pi_desc->control. Then it attempts to send a
> notification-vector IPI to dest L1 CPU.
> This attempt to send IPI will exit to L0 which will reach
> vmx_deliver_nested_posted_interrupt() which does the following:
> 1. If dest L0 CPU is currently running guest (vcpu->mode ==
> IN_GUEST_MODE), it sends a physical IPI of PI nested-notification-vector.
> 2. It sets KVM_REQ_EVENT in dest vCPU. This is done such that if dest
> L0 CPU exits from guest to host and therefore doesn't recognize physical
> IPI (or it wasn't sent), then KVM_REQ_EVENT will be consumed on next
> vmentry which will call vmx_check_nested_events() which should call
> (in theory) vmx_complete_nested_posted_interrupt(). That function
> should see vmx->nested.pi_desc->control ON bit is set and therefore
> "emulate" posted-interrupt delivery for L1 (sync PIR to IRR in L1
> virtual-apic-page & update vmcs02 RVI).
>
> The above logic regarding nested-posted-interrupts contains multiple
> issues:
>
> A) Race-condition:
> On step (1) it is possible sender will see vcpu->mode == IN_GUEST_MODE
> but before sending physical IPI, the dest CPU will exit to host.
> Therefore, physical IPI could be received at host which it's hanlder
> does nothing. In addition, assume that dest CPU passes the checks
> for pending kvm requests before sender sets KVM_REQ_EVENT. Therefore,
> dest CPU will resume L2 without evaluating nested-posted-interrupts.
>
> B) vmx_complete_nested_posted_interrupt() is not always called
> when needed. Even if dest CPU consumed KVM_REQ_EVENT, there is a bug
> that vmx_check_nested_events() could exit from L2 to L1 before calling
> vmx_complete_nested_posted_interrupt(). Therefore, on next resume of
> L1 into L2, nested-posted-interrupts won't be evaluated even though L0
> resume L2 (We may resume L2 without having KVM_REQ_EVENT set).
>
> This commit removes nested-posted-interrupts processing from
> check_nested_events() and instead makes sure to process
> nested-posted-interrupts on vmentry after interrupts
> disabled. Processing of nested-posted-interrupts is delegated
> to hardware by issueing a self-IPI of relevant notification-vector
> which will be delivered to CPU when CPU is in guest.
>
> * Bug (A) is solved by the fact that processing of
> nested-posted-interrupts is not dependent on KVM_REQ_EVENT and
> happens before every vmentry to L2.
> * Bug (B) is now trivially solved by processing
> nested-posted-interrupts before each vmentry to L2 guest.
>
> An alternative could have been to just call
> vmx_complete_nested_posted_interrupt() at this call-site. However, we
> have decided to go with this approach because:
> 1. It would require modifying vmx_complete_nested_posted_interrupt()
> to be able to work with interrupts disabled (not-trivial).
> 2. We preffer to avoid software-emulation of hardware behavior if it
> is possible.
>
> Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
>
> Signed-off-by: Liran Alon <liran.alon@xxxxxxxxxx>
> Co-authored-by: Nikita Leshenko <nikita.leshchenko@xxxxxxxxxx>
> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx>
> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/vmx.c              | 57 ++++++++++-------------------------------
>  arch/x86/kvm/x86.c              | 14 +++++++---
>  3 files changed, 25 insertions(+), 47 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1bfb99770c34..dc0affe69903 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -980,6 +980,7 @@ struct kvm_x86_ops {
>         void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
>         void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
>         void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
> +       void (*complete_nested_posted_interrupt)(struct kvm_vcpu *vcpu);
>         int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>         int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>         int (*get_tdp_level)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index bbc023fb9ef1..517822f94158 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -534,12 +534,6 @@ static bool pi_test_and_set_on(struct pi_desc *pi_desc)
>                         (unsigned long *)&pi_desc->control);
>  }
>
> -static bool pi_test_and_clear_on(struct pi_desc *pi_desc)
> -{
> -       return test_and_clear_bit(POSTED_INTR_ON,
> -                       (unsigned long *)&pi_desc->control);
> -}
> -
>  static int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
>  {
>         return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
> @@ -5041,37 +5035,6 @@ static void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu)
>         }
>  }
>
> -
> -static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
> -{
> -       struct vcpu_vmx *vmx = to_vmx(vcpu);
> -       int max_irr;
> -       void *vapic_page;
> -       u16 status;
> -
> -       if (!vmx->nested.pi_desc)
> -               return;
> -
> -       if (!pi_test_and_clear_on(vmx->nested.pi_desc))
> -               return;
> -
> -       max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
> -       if (max_irr != 256) {
> -               vapic_page = kmap(vmx->nested.virtual_apic_page);
> -               __kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page);
> -               kunmap(vmx->nested.virtual_apic_page);
> -
> -               status = vmcs_read16(GUEST_INTR_STATUS);
> -               if ((u8)max_irr > ((u8)status & 0xff)) {
> -                       status &= ~0xff;
> -                       status |= (u8)max_irr;
> -                       vmcs_write16(GUEST_INTR_STATUS, status);
> -               }
> -       }
> -
> -       nested_mark_vmcs12_pages_dirty(vcpu);
> -}
> -
>  static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
>                                                      bool nested)
>  {
> @@ -5120,11 +5083,6 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
>             vector == vmx->nested.posted_intr_nv) {
>                 /* the PIR and ON have been set by L1. */
>                 kvm_vcpu_trigger_posted_interrupt(vcpu, true);
> -               /*
> -                * If a posted intr is not recognized by hardware,
> -                * we will accomplish it in the next vmentry.
> -                */
> -               kvm_make_request(KVM_REQ_EVENT, vcpu);
>                 return 0;
>         }
>         return -1;
> @@ -5156,6 +5114,17 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
>                 kvm_vcpu_kick(vcpu);
>  }
>
> +static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
> +{
> +       struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +       WARN_ON(!vcpu->arch.apicv_active);
> +
> +       if (is_guest_mode(vcpu) && vmx->nested.pi_desc &&
> +           pi_test_on(vmx->nested.pi_desc))
> +               kvm_vcpu_trigger_posted_interrupt(vcpu, true);
> +}
> +
>  /*
>   * Set up the vmcs's constant host-state fields, i.e., host-state fields that
>   * will not change in the lifetime of the guest.
> @@ -6823,6 +6792,7 @@ static __init int hardware_setup(void)
>         if (!cpu_has_vmx_apicv()) {
>                 enable_apicv = 0;
>                 kvm_x86_ops->sync_pir_to_irr = NULL;
> +               kvm_x86_ops->complete_nested_posted_interrupt = NULL;
>         }
>
>         if (cpu_has_vmx_tsc_scaling()) {
> @@ -11144,7 +11114,6 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
>                 return 0;
>         }
>
> -       vmx_complete_nested_posted_interrupt(vcpu);
>         return 0;
>  }
>
> @@ -12136,6 +12105,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
>         .hwapic_isr_update = vmx_hwapic_isr_update,
>         .sync_pir_to_irr = vmx_sync_pir_to_irr,
>         .deliver_posted_interrupt = vmx_deliver_posted_interrupt,
> +       .complete_nested_posted_interrupt =
> +               vmx_complete_nested_posted_interrupt,
>
>         .set_tss_addr = vmx_set_tss_addr,
>         .get_tdp_level = get_ept_level,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 34c85aa2e2d1..6fb58ab9a85c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6962,12 +6962,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>         smp_mb__after_srcu_read_unlock();
>
>         /*
> -        * This handles the case where a posted interrupt was
> -        * notified with kvm_vcpu_kick.
> +        * In case guest got the posted-interrupt notification
> +        * vector while running in host, we need to make sure
> +        * it arrives to guest.
> +        * For L1 posted-interrupts, we manually sync PIR to IRR.
> +        * For L2 posted-interrupts, we send notification-vector
> +        * again by self IPI such that it will now be received in guest.
>          */
> -       if (kvm_lapic_enabled(vcpu)) {
> -               if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active)
> +       if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) {
> +               if (kvm_x86_ops->sync_pir_to_irr)
>                         kvm_x86_ops->sync_pir_to_irr(vcpu);
> +               if (kvm_x86_ops->complete_nested_posted_interrupt)
> +                       kvm_x86_ops->complete_nested_posted_interrupt(vcpu);
>         }
>
>         if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu)
> --
> 1.9.1
>