Good idea ! Using self-ipi is much more simple and efficient than emulation. On Tue, Dec 5, 2017 at 4:16 PM, Liran Alon <liran.alon@xxxxxxxxxx> wrote: > When L1 wants to send a posted-interrupt to another L1 CPU running L2, > it sets the relevant bit in vmx->nested.pi_desc->pir and ON bit in > vmx->nested.pi_desc->control. Then it attempts to send a > notification-vector IPI to dest L1 CPU. > This attempt to send IPI will exit to L0 which will reach > vmx_deliver_nested_posted_interrupt() which does the following: > 1. If dest L0 CPU is currently running guest (vcpu->mode == > IN_GUEST_MODE), it sends a physical IPI of PI nested-notification-vector. > 2. It sets KVM_REQ_EVENT in dest vCPU. This is done such that if dest > L0 CPU exits from guest to host and therefore doesn't recognize physical > IPI (or it wasn't sent), then KVM_REQ_EVENT will be consumed on next > vmentry which will call vmx_check_nested_events() which should call > (in theory) vmx_complete_nested_posted_interrupt(). That function > should see vmx->nested.pi_desc->control ON bit is set and therefore > "emulate" posted-interrupt delivery for L1 (sync PIR to IRR in L1 > virtual-apic-page & update vmcs02 RVI). > > The above logic regarding nested-posted-interrupts contains multiple > issues: > > A) Race-condition: > On step (1) it is possible sender will see vcpu->mode == IN_GUEST_MODE > but before sending physical IPI, the dest CPU will exit to host. > Therefore, physical IPI could be received at host which it's hanlder > does nothing. In addition, assume that dest CPU passes the checks > for pending kvm requests before sender sets KVM_REQ_EVENT. Therefore, > dest CPU will resume L2 without evaluating nested-posted-interrupts. > > B) vmx_complete_nested_posted_interrupt() is not always called > when needed. Even if dest CPU consumed KVM_REQ_EVENT, there is a bug > that vmx_check_nested_events() could exit from L2 to L1 before calling > vmx_complete_nested_posted_interrupt(). Therefore, on next resume of > L1 into L2, nested-posted-interrupts won't be evaluated even though L0 > resume L2 (We may resume L2 without having KVM_REQ_EVENT set). > > This commit removes nested-posted-interrupts processing from > check_nested_events() and instead makes sure to process > nested-posted-interrupts on vmentry after interrupts > disabled. Processing of nested-posted-interrupts is delegated > to hardware by issueing a self-IPI of relevant notification-vector > which will be delivered to CPU when CPU is in guest. > > * Bug (A) is solved by the fact that processing of > nested-posted-interrupts is not dependent on KVM_REQ_EVENT and > happens before every vmentry to L2. > * Bug (B) is now trivially solved by processing > nested-posted-interrupts before each vmentry to L2 guest. > > An alternative could have been to just call > vmx_complete_nested_posted_interrupt() at this call-site. However, we > have decided to go with this approach because: > 1. It would require modifying vmx_complete_nested_posted_interrupt() > to be able to work with interrupts disabled (not-trivial). > 2. We preffer to avoid software-emulation of hardware behavior if it > is possible. > > Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing") > > Signed-off-by: Liran Alon <liran.alon@xxxxxxxxxx> > Co-authored-by: Nikita Leshenko <nikita.leshchenko@xxxxxxxxxx> > Reviewed-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx> > Signed-off-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > --- > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/vmx.c | 57 ++++++++++------------------------------- > arch/x86/kvm/x86.c | 14 +++++++--- > 3 files changed, 25 insertions(+), 47 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 1bfb99770c34..dc0affe69903 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -980,6 +980,7 @@ struct kvm_x86_ops { > void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); > void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa); > void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); > + void (*complete_nested_posted_interrupt)(struct kvm_vcpu *vcpu); > int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); > int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); > int (*get_tdp_level)(struct kvm_vcpu *vcpu); > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index bbc023fb9ef1..517822f94158 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -534,12 +534,6 @@ static bool pi_test_and_set_on(struct pi_desc *pi_desc) > (unsigned long *)&pi_desc->control); > } > > -static bool pi_test_and_clear_on(struct pi_desc *pi_desc) > -{ > - return test_and_clear_bit(POSTED_INTR_ON, > - (unsigned long *)&pi_desc->control); > -} > - > static int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc) > { > return test_and_set_bit(vector, (unsigned long *)pi_desc->pir); > @@ -5041,37 +5035,6 @@ static void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu) > } > } > > - > -static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) > -{ > - struct vcpu_vmx *vmx = to_vmx(vcpu); > - int max_irr; > - void *vapic_page; > - u16 status; > - > - if (!vmx->nested.pi_desc) > - return; > - > - if (!pi_test_and_clear_on(vmx->nested.pi_desc)) > - return; > - > - max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256); > - if (max_irr != 256) { > - vapic_page = kmap(vmx->nested.virtual_apic_page); > - __kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page); > - kunmap(vmx->nested.virtual_apic_page); > - > - status = vmcs_read16(GUEST_INTR_STATUS); > - if ((u8)max_irr > ((u8)status & 0xff)) { > - status &= ~0xff; > - status |= (u8)max_irr; > - vmcs_write16(GUEST_INTR_STATUS, status); > - } > - } > - > - nested_mark_vmcs12_pages_dirty(vcpu); > -} > - > static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu, > bool nested) > { > @@ -5120,11 +5083,6 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu, > vector == vmx->nested.posted_intr_nv) { > /* the PIR and ON have been set by L1. */ > kvm_vcpu_trigger_posted_interrupt(vcpu, true); > - /* > - * If a posted intr is not recognized by hardware, > - * we will accomplish it in the next vmentry. > - */ > - kvm_make_request(KVM_REQ_EVENT, vcpu); > return 0; > } > return -1; > @@ -5156,6 +5114,17 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) > kvm_vcpu_kick(vcpu); > } > > +static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) > +{ > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + > + WARN_ON(!vcpu->arch.apicv_active); > + > + if (is_guest_mode(vcpu) && vmx->nested.pi_desc && > + pi_test_on(vmx->nested.pi_desc)) > + kvm_vcpu_trigger_posted_interrupt(vcpu, true); > +} > + > /* > * Set up the vmcs's constant host-state fields, i.e., host-state fields that > * will not change in the lifetime of the guest. > @@ -6823,6 +6792,7 @@ static __init int hardware_setup(void) > if (!cpu_has_vmx_apicv()) { > enable_apicv = 0; > kvm_x86_ops->sync_pir_to_irr = NULL; > + kvm_x86_ops->complete_nested_posted_interrupt = NULL; > } > > if (cpu_has_vmx_tsc_scaling()) { > @@ -11144,7 +11114,6 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) > return 0; > } > > - vmx_complete_nested_posted_interrupt(vcpu); > return 0; > } > > @@ -12136,6 +12105,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu) > .hwapic_isr_update = vmx_hwapic_isr_update, > .sync_pir_to_irr = vmx_sync_pir_to_irr, > .deliver_posted_interrupt = vmx_deliver_posted_interrupt, > + .complete_nested_posted_interrupt = > + vmx_complete_nested_posted_interrupt, > > .set_tss_addr = vmx_set_tss_addr, > .get_tdp_level = get_ept_level, > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 34c85aa2e2d1..6fb58ab9a85c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -6962,12 +6962,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > smp_mb__after_srcu_read_unlock(); > > /* > - * This handles the case where a posted interrupt was > - * notified with kvm_vcpu_kick. > + * In case guest got the posted-interrupt notification > + * vector while running in host, we need to make sure > + * it arrives to guest. > + * For L1 posted-interrupts, we manually sync PIR to IRR. > + * For L2 posted-interrupts, we send notification-vector > + * again by self IPI such that it will now be received in guest. > */ > - if (kvm_lapic_enabled(vcpu)) { > - if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active) > + if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) { > + if (kvm_x86_ops->sync_pir_to_irr) > kvm_x86_ops->sync_pir_to_irr(vcpu); > + if (kvm_x86_ops->complete_nested_posted_interrupt) > + kvm_x86_ops->complete_nested_posted_interrupt(vcpu); > } > > if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) > -- > 1.9.1 >