On 09/03/2017 10:40, Wanpeng Li wrote: > 2017-03-09 9:23 GMT+08:00 Wanpeng Li <kernellwp@xxxxxxxxx>: >> 2016-12-20 0:17 GMT+08:00 Paolo Bonzini <pbonzini@xxxxxxxxxx>: >>> Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU >>> is blocked", 2015-09-18) the posted interrupt descriptor is checked >>> unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to >>> trigger the scan and, if NMIs or SMIs are not involved, we can avoid >>> the complicated event injection path. >>> >>> Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been >>> there since APICv was introduced. >>> >>> However, without the KVM_REQ_EVENT safety net KVM needs to be much >>> more careful about races between vmx_deliver_posted_interrupt and >>> vcpu_enter_guest. First, the IPI for posted interrupts may be issued >>> between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts. >>> If that happens, kvm_trigger_posted_interrupt returns true, but >>> smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is >>> entered with PIR.ON, but the posted interrupt IPI has not been sent >>> and the interrupt is only delivered to the guest on the next vmentry >>> (if any). To fix this, disable interrupts before setting vcpu->mode. >>> This ensures that the IPI is delayed until the guest enters non-root mode; >>> it is then trapped by the processor causing the interrupt to be injected. >>> >>> Second, the IPI may be issued between >>> >>> kvm_x86_ops->hwapic_irr_update(vcpu, >>> kvm_lapic_find_highest_irr(vcpu)); >>> >>> and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called >>> but it (correctly) doesn't do anything because it sees vcpu->mode == >>> OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no >>> posted interrupt IPI is pending; this time, the fix for this is to move >>> the RVI update after IN_GUEST_MODE. >>> >>> Both issues were previously masked by the liberal usage of KVM_REQ_EVENT. >>> In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting >>> in another vmentry which would inject the interrupt. >>> >>> This saves about 300 cycles on the self_ipi_* tests of vmexit.flat. >>> >>> Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> >>> --- >>> arch/x86/kvm/lapic.c | 11 ++++------- >>> arch/x86/kvm/vmx.c | 8 +++++--- >>> arch/x86/kvm/x86.c | 44 +++++++++++++++++++++++++------------------- >>> 3 files changed, 34 insertions(+), 29 deletions(-) >>> >>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >>> index f644dd1dbe71..5ea94b622e88 100644 >>> --- a/arch/x86/kvm/lapic.c >>> +++ b/arch/x86/kvm/lapic.c >>> @@ -385,12 +385,8 @@ int __kvm_apic_update_irr(u32 *pir, void *regs) >>> int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir) >>> { >>> struct kvm_lapic *apic = vcpu->arch.apic; >>> - int max_irr; >>> >>> - max_irr = __kvm_apic_update_irr(pir, apic->regs); >>> - >>> - kvm_make_request(KVM_REQ_EVENT, vcpu); >>> - return max_irr; >>> + return __kvm_apic_update_irr(pir, apic->regs); >>> } >>> EXPORT_SYMBOL_GPL(kvm_apic_update_irr); >>> >>> @@ -423,9 +419,10 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic) >>> vcpu = apic->vcpu; >>> >>> if (unlikely(vcpu->arch.apicv_active)) { >>> - /* try to update RVI */ >>> + /* need to update RVI */ >>> apic_clear_vector(vec, apic->regs + APIC_IRR); >>> - kvm_make_request(KVM_REQ_EVENT, vcpu); >>> + kvm_x86_ops->hwapic_irr_update(vcpu, >>> + apic_find_highest_irr(apic)); >>> } else { >>> apic->irr_pending = false; >>> apic_clear_vector(vec, apic->regs + APIC_IRR); >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>> index 27e40b180242..3dd4fad35a3e 100644 >>> --- a/arch/x86/kvm/vmx.c >>> +++ b/arch/x86/kvm/vmx.c >>> @@ -5062,9 +5062,11 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) >>> if (pi_test_and_set_pir(vector, &vmx->pi_desc)) >>> return; >>> >>> - r = pi_test_and_set_on(&vmx->pi_desc); >>> - kvm_make_request(KVM_REQ_EVENT, vcpu); >>> - if (r || !kvm_vcpu_trigger_posted_interrupt(vcpu)) >>> + /* If a previous notification has sent the IPI, nothing to do. */ >>> + if (pi_test_and_set_on(&vmx->pi_desc)) >>> + return; >>> + >>> + if (!kvm_vcpu_trigger_posted_interrupt(vcpu)) >>> kvm_vcpu_kick(vcpu); >>> } >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index c666414adc1d..725473ba6dd3 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -6710,19 +6710,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) >>> kvm_hv_process_stimers(vcpu); >>> } >>> >>> - /* >>> - * KVM_REQ_EVENT is not set when posted interrupts are set by >>> - * VT-d hardware, so we have to update RVI unconditionally. >>> - */ >>> - if (kvm_lapic_enabled(vcpu)) { >>> - /* >>> - * Update architecture specific hints for APIC >>> - * virtual interrupt delivery. >>> - */ >>> - if (kvm_x86_ops->sync_pir_to_irr) >>> - kvm_x86_ops->sync_pir_to_irr(vcpu); >>> - } >>> - >>> if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { >>> ++vcpu->stat.req_event; >>> kvm_apic_accept_events(vcpu); >>> @@ -6767,20 +6754,39 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) >>> kvm_x86_ops->prepare_guest_switch(vcpu); >>> if (vcpu->fpu_active) >>> kvm_load_guest_fpu(vcpu); >>> + >>> + /* >>> + * Disabling IRQs before setting IN_GUEST_MODE. Posted interrupt >>> + * IPI are then delayed after guest entry, which ensures that they >>> + * result in virtual interrupt delivery. >>> + */ >>> + local_irq_disable(); >>> vcpu->mode = IN_GUEST_MODE; >>> >>> srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); >>> >>> /* >>> - * We should set ->mode before check ->requests, >>> - * Please see the comment in kvm_make_all_cpus_request. >>> - * This also orders the write to mode from any reads >>> - * to the page tables done while the VCPU is running. >>> - * Please see the comment in kvm_flush_remote_tlbs. >>> + * 1) We should set ->mode before checking ->requests. Please see >>> + * the comment in kvm_make_all_cpus_request. >>> + * >>> + * 2) For APICv, we should set ->mode before checking PIR.ON. This >>> + * pairs with the memory barrier implicit in pi_test_and_set_on >>> + * (see vmx_deliver_posted_interrupt). >>> + * >>> + * 3) This also orders the write to mode from any reads to the page >>> + * tables done while the VCPU is running. Please see the comment >>> + * in kvm_flush_remote_tlbs. >>> */ >>> smp_mb__after_srcu_read_unlock(); >>> >>> - local_irq_disable(); >> >> The local_irq_disable() movement is unnecessary if you move sync_pir_to_irr. > > In addition, this movement will increase the time of irq disable to > some degree. Do you think I can send a patch to revert it? The difference is a few dozen hundred clock cycles, I don't think it matters. Also, a posted interrupt sent to the host while IN_GUEST_MODE is more expensive than one sent while the processor is in non-root mode. All in all, I think it's preferrable to keep the local_irq_disable here. Your observation seems correct though. Paolo > Regards, > Wanpeng Li > >> >> - IPI after vcpu->mode = IN_GUEST_MODE and interrupt disable, PI is >> successfully. >> - IPI between vcpu->mode = IN_GUEST_MODE and interrupt disable, the >> sync_ir_to_irr will catch the PIR and set RVI. >> >> Regards, >> Wanpeng Li >> >>> + if (kvm_lapic_enabled(vcpu)) { >>> + /* >>> + * This handles the case where a posted interrupt was >>> + * notified with kvm_vcpu_kick. >>> + */ >>> + if (kvm_x86_ops->sync_pir_to_irr) >>> + kvm_x86_ops->sync_pir_to_irr(vcpu); >>> + } >>> >>> if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests >>> || need_resched() || signal_pending(current)) { >>> -- >>> 1.8.3.1 >>>