> From: Chao Gao <chao.gao@xxxxxxxxx> > Sent: Friday, June 17, 2022 7:47 PM > > PIN (Posted interrupt notification) is useless to host as KVM always syncs > pending guest interrupts in PID to guest's vAPIC before each VM entry. In > fact, Sending PINs to a CPU running in host will lead to additional > overhead due to interrupt handling. > > Currently, software path, vmx_deliver_posted_interrupt(), is optimized to > issue PINs only if target vCPU is in IN_GUEST_MODE. But hardware paths > (VT-d and Intel IPI virtualization) aren't optimized. > > Set PID.SN right after VM exits and clear it before VM entry to minimize > the chance of hardware issuing PINs to a CPU when it's in host. > > Also honour PID.SN bit in vmx_deliver_posted_interrupt(). > > When IPI virtualization is enabled, this patch increases "perf bench" [*] > by 4% from 8.12 us/ops to 7.80 us/ops. > > [*] test cmd: perf bench sched pipe -T. Note that we change the source > code to pin two threads to two different vCPUs so that it can reproduce > stable results. > > Signed-off-by: Chao Gao <chao.gao@xxxxxxxxx> > --- > arch/x86/kvm/vmx/posted_intr.c | 28 ++-------------------------- > arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++- > 2 files changed, 25 insertions(+), 27 deletions(-) > > diff --git a/arch/x86/kvm/vmx/posted_intr.c > b/arch/x86/kvm/vmx/posted_intr.c > index 237a1f40f939..a0458f72df99 100644 > --- a/arch/x86/kvm/vmx/posted_intr.c > +++ b/arch/x86/kvm/vmx/posted_intr.c > @@ -70,12 +70,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int > cpu) > * needs to be changed. > */ > if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == > cpu) { > - /* > - * Clear SN if it was set due to being preempted. Again, do > - * this even if there is no assigned device for simplicity. > - */ > - if (pi_test_and_clear_sn(pi_desc)) > - goto after_clear_sn; > return; > } > > @@ -99,12 +93,8 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int > cpu) > do { > old.control = new.control = READ_ONCE(pi_desc->control); > > - /* > - * Clear SN (as above) and refresh the destination APIC ID to > - * handle task migration (@cpu != vcpu->cpu). > - */ > new.ndst = dest; > - new.sn = 0; > + new.sn = 1; A comment is appreciated. > > /* > * Restore the notification vector; in the blocking case, the > @@ -114,19 +104,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int > cpu) > } while (pi_try_set_control(pi_desc, old.control, new.control)); > > local_irq_restore(flags); > - > -after_clear_sn: > - > - /* > - * Clear SN before reading the bitmap. The VT-d firmware > - * writes the bitmap and reads SN atomically (5.2.3 in the > - * spec), so it doesn't really have a memory barrier that > - * pairs with this, but we cannot do that and we need one. > - */ > - smp_mb__after_atomic(); > - > - if (!pi_is_pir_empty(pi_desc)) > - pi_set_on(pi_desc); > } > > static bool vmx_can_use_vtd_pi(struct kvm *kvm) > @@ -154,13 +131,12 @@ static void pi_enable_wakeup_handler(struct > kvm_vcpu *vcpu) > &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); > raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu- > >cpu)); > > - WARN(pi_desc->sn, "PI descriptor SN field set before blocking"); > - > do { > old.control = new.control = READ_ONCE(pi_desc->control); > > /* set 'NV' to 'wakeup vector' */ > new.nv = POSTED_INTR_WAKEUP_VECTOR; > + new.sn = 0; > } while (pi_try_set_control(pi_desc, old.control, new.control)); > there is a problem a few lines downwards: /* * Send a wakeup IPI to this CPU if an interrupt may have been posted * before the notification vector was updated, in which case the IRQ * will arrive on the non-wakeup vector. An IPI is needed as calling * try_to_wake_up() from ->sched_out() isn't allowed (IRQs are not * enabled until it is safe to call try_to_wake_up() on the task being * scheduled out). */ if (pi_test_on(&new)) apic->send_IPI_self(POSTED_INTR_WAKEUP_VECTOR); 'on' is not set when SN is set. This is different from original assumption which has SN cleared in above window. In this case pi_test_on() should be replaced with pi_is_pir_empty(). There is another simplification possible in vmx_vcpu_pi_put(): if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu)) pi_enable_wakeup_handler(vcpu); /* * Set SN when the vCPU is preempted. Note, the vCPU can both be seen * as blocking and preempted, e.g. if it's preempted between setting * its wait state and manually scheduling out. */ if (vcpu->preempted) pi_set_sn(pi_desc); With this patch 'sn' is already set when a runnable vcpu is preempted hence above is required only for a blocking vcpu. And in the blocking case if the notification is anyway suppressed it's pointless to further change the notification vector. Then it could be simplified as: if (!vcpu->preempted && kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu)) pi_enable_wakeup_handler(vcpu); > /* > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index a3c5504601a8..fa915b1680eb 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -4036,6 +4036,9 @@ static int vmx_deliver_posted_interrupt(struct > kvm_vcpu *vcpu, int vector) > if (pi_test_and_set_pir(vector, &vmx->pi_desc)) > return 0; > > + if (pi_test_sn(&vmx->pi_desc)) > + return 0; > + > /* If a previous notification has sent the IPI, nothing to do. */ > if (pi_test_and_set_on(&vmx->pi_desc)) > return 0; > @@ -6520,8 +6523,17 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu > *vcpu) > if (KVM_BUG_ON(!enable_apicv, vcpu->kvm)) > return -EIO; > > - if (pi_test_on(&vmx->pi_desc)) { > + if (pi_test_on(&vmx->pi_desc) || pi_test_sn(&vmx->pi_desc)) { this has potential side-effect in vmexit/vmentry path where pi_desc is always scanned now. While reducing interrupts help the IPC test case, do you observe any impact on other scenarios where interrupts are not the main cause of vmexits? > pi_clear_on(&vmx->pi_desc); > + > + /* > + * IN_GUEST_MODE means we are about to enter vCPU. > Allow > + * PIN (posted interrupt notification) to deliver is key > + * to interrupt posting. Clear PID.SN. > + */ > + if (vcpu->mode == IN_GUEST_MODE) > + pi_clear_sn(&vmx->pi_desc); I wonder whether it makes more sense to have 'sn' closely sync-ed with vcpu->mode, e.g. having a kvm_x86_set_vcpu_mode() ops to translate vcpu->mode into vmx/svm specific hardware bits like 'sn' here. Then call it in common place when vcpu->mode is changed. > + > /* > * IOMMU can write to PID.ON, so the barrier matters even > on UP. > * But on x86 this is just a compiler barrier anyway. > @@ -6976,6 +6988,16 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu > *vcpu) > /* The actual VMENTER/EXIT is in the .noinstr.text section. */ > vmx_vcpu_enter_exit(vcpu, vmx); > > + /* > + * Suppress notification right after VM exits to minimize the > + * window where VT-d or remote CPU may send a useless notification > + * when posting interrupts to a VM. Note that the notification is > + * useless because KVM syncs pending interrupts in PID.IRR to vAPIC > + * IRR before VM entry. > + */ > + if (kvm_vcpu_apicv_active(vcpu)) > + pi_set_sn(&vmx->pi_desc); > + > /* > * We do not use IBRS in the kernel. If this vCPU has used the > * SPEC_CTRL MSR it may have left it on; save the value and > -- > 2.25.1