[PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> · Thu, 28 Mar 2019 21:31:10 +0100

This is embarassing but we have another Windows/Hyper-V issue to workaround
in KVM (or QEMU). Hope "RFC" makes it less offensive.

It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g.
'piix4-usb-uhci' device is attached. The problem with this device is that
it uses level-triggered interrupts.

The 'hang' scenario develops like this:
1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
 of them is level-triggered. KVM injects the edge-triggered one and
 requests immediate exit to inject the level-triggered:

 kvm_set_irq:          gsi 23 level 1 source 0
 kvm_msi_set_irq:      dst 0 vec 80 (Fixed|physical|level)
 kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
 kvm_msi_set_irq:      dst 0 vec 96 (Fixed|physical|edge)
 kvm_apic_accept_irq:  apicid 0 vec 96 (Fixed|edge)
 kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0

2) Hyper-V requires one of its VMs to run to handle the situation but
 immediate exit happens:

 kvm_entry:            vcpu 0
 kvm_exit:             reason VMRESUME rip 0xfffff80006a40115 info 0 0
 kvm_entry:            vcpu 0
 kvm_exit:             reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
 kvm_nested_vmexit:    rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
 kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0

3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't
 process the first one) so it just does 'EOI' for level-triggered interrupt
 and this causes re-injection:

 kvm_exit:             reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
 kvm_eoi:              apicid 0 vector 80
 kvm_userspace_exit:   reason KVM_EXIT_IOAPIC_EOI (26)
 kvm_set_irq:          gsi 23 level 1 source 0
 kvm_msi_set_irq:      dst 0 vec 80 (Fixed|physical|level)
 kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
 kvm_entry:            vcpu 0

4) As we arm preemtion timer again we get stuck in the infinite loop:

 kvm_exit:             reason VMRESUME rip 0xfffff80006a40115 info 0 0
 kvm_entry:            vcpu 0
 kvm_exit:             reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
 kvm_nested_vmexit:    rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
 kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
 kvm_entry:            vcpu 0
 kvm_exit:             reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
 kvm_eoi:              apicid 0 vector 80
 kvm_userspace_exit:   reason KVM_EXIT_IOAPIC_EOI (26)
 kvm_set_irq:          gsi 23 level 1 source 0
 kvm_msi_set_irq:      dst 0 vec 80 (Fixed|physical|level)
 kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
 kvm_entry:            vcpu 0
 kvm_exit:             reason VMRESUME rip 0xfffff80006a40115 info 0 0
 ...

How does this work on hardware? I don't really know but it seems that we
were dealing with similar issues before, see commit 184564efae4d7 ("kvm:
ioapic: conditionally delay irq delivery duringeoi broadcast"). In case EOI
doesn't always cause an *immediate* interrupt re-injection some progress
can be made.

An alternative approach would be to port the above mentioned commit to
QEMU's ioapic implementation. I'm, however, not sure that level-triggered
interrupts is a must to trigger the issue.

HV_TIMER_THROTTLE/HV_TIMER_DELAY are more or less arbitrary. I haven't
looked at SVM yet but their immediate exit implementation does
smp_send_reschedule(), I'm not sure this can cause the above described
lockup.

Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
---
 arch/x86/kvm/vmx/vmcs.h |  2 ++
 arch/x86/kvm/vmx/vmx.c  | 24 +++++++++++++++++++++++-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index cb6079f8a227..983dfc60c30c 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -54,6 +54,8 @@ struct loaded_vmcs {
 	bool launched;
 	bool nmi_known_unmasked;
 	bool hv_timer_armed;
+	unsigned long hv_timer_lastrip;
+	int hv_timer_lastrip_cnt;
 	/* Support for vnmi-less CPUs */
 	int soft_vnmi_blocked;
 	ktime_t entry_time;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 617443c1c6d7..8a49ec14dd3a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6321,14 +6321,36 @@ static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
 	vmx->loaded_vmcs->hv_timer_armed = true;
 }
 
+#define HV_TIMER_THROTTLE 100
+#define HV_TIMER_DELAY 1000
+
 static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u64 tscl;
 	u32 delta_tsc;
 
+	/*
+	 * Some guests are buggy and may behave in a way that causes KVM to
+	 * always request immediate exit (e.g. of a nested guest). At the same
+	 * time guest progress may be required to resolve the condition.
+	 * Throttling below makes sure we never get stuck completely.
+	 */
 	if (vmx->req_immediate_exit) {
-		vmx_arm_hv_timer(vmx, 0);
+		unsigned long rip = kvm_rip_read(vcpu);
+
+		if (vmx->loaded_vmcs->hv_timer_lastrip == rip) {
+			++vmx->loaded_vmcs->hv_timer_lastrip_cnt;
+		} else {
+			vmx->loaded_vmcs->hv_timer_lastrip_cnt = 0;
+			vmx->loaded_vmcs->hv_timer_lastrip = rip;
+		}
+
+		if (vmx->loaded_vmcs->hv_timer_lastrip_cnt > HV_TIMER_THROTTLE)
+			vmx_arm_hv_timer(vmx, HV_TIMER_DELAY);
+		else
+			vmx_arm_hv_timer(vmx, 0);
+
 		return;
 	}
 
-- 
2.20.1