When L1 IOAPIC redirection-table is written, a request of KVM_REQ_SCAN_IOAPIC is set on all vCPUs. This is done such that all vCPUs will now recalc their IOAPIC handled vectors. However, it could be that one of the vCPUs is currently running L2. In this case, vcpu_scan_ioapic() will be called while is_guest_mode(vcpu) == true. In this case, load_eoi_exitmap() will be called which would write to vmcs02->eoi_exit_bitmap, which is wrong because vmcs02->eoi_exit_bitmap should always be equal to vmcs12->eoi_exit_bitmap. Furthermore, at this point KVM_REQ_SCAN_IOAPIC was already consumed and therefore we will never update vmcs01->eoi_exit_bitmap. Which could lead to remote_irr of some IOAPIC level-triggered entry to remain set forever. Fix this issue by delaying KVM_REQ_SCAN_IOAPIC processing to execute only when running L1 (is_guest_mode(vcpu) == false). Issue was reproduced with the following setup: * L0 runs KVM with 64 CPUs * L1 runs ESXi 6.0 with 8 CPUs * ESXi runs 4 L2 VMs: 1. Windows 8.1 32bit with 4 CPUs 2. Ubuntu 17 Server with 4 CPUs 3. Ubuntu Desktop with 2 CPUs 4. CentOS 32bit with 1 CPU A short while after booting all the L2 VMs, ESXi lost networking. Examining the issue revealed that ESXi dynamically reconfigures the IOAPIC redirection-table entry of the NIC. Shortly after leading to that entry's remote_irr being set forever. Signed-off-by: Liran Alon <liran.alon@xxxxxxxxxx> Reviewed-by: Arbel Moshe <arbel.moshe@xxxxxxxxxx> Reviewed-by: Nikita Leshenko <nikita.leshchenko@xxxxxxxxxx> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx> --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/x86.c | 10 +++++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c73e493adf07..ceb8beb1bfc9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -498,6 +498,7 @@ struct kvm_vcpu_arch { u64 apic_base; struct kvm_lapic *apic; /* kernel irqchip context */ bool apicv_active; + bool scan_ioapic_pending; DECLARE_BITMAP(ioapic_handled_vectors, 256); unsigned long apic_attention; int32_t apic_arb_prio; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 03869eb7fcd6..ac1339148a9a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6720,6 +6720,12 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) if (!kvm_apic_hw_enabled(vcpu->arch.apic)) return; + if (is_guest_mode(vcpu)) { + vcpu->arch.scan_ioapic_pending = true; + return; + } + vcpu->arch.scan_ioapic_pending = false; + bitmap_zero(vcpu->arch.ioapic_handled_vectors, 256); if (irqchip_split(vcpu->kvm)) @@ -6833,7 +6839,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) goto out; } } - if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu)) + if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu) || + (!is_guest_mode(vcpu) && vcpu->arch.scan_ioapic_pending)) vcpu_scan_ioapic(vcpu); if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu)) kvm_vcpu_reload_apic_access_page(vcpu); @@ -7981,6 +7988,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) kvm = vcpu->kvm; vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu); + vcpu->arch.scan_ioapic_pending = false; vcpu->arch.pv.pv_unhalted = false; vcpu->arch.emulate_ctxt.ops = &emulate_ops; if (!irqchip_in_kernel(kvm) || kvm_vcpu_is_reset_bsp(vcpu)) -- 1.9.1