> On 29 Mar 2019, at 12:14, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> wrote: > > Liran Alon <liran.alon@xxxxxxxxxx> writes: > >>> On 28 Mar 2019, at 22:31, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> wrote: >>> >>> This is embarassing but we have another Windows/Hyper-V issue to workaround >>> in KVM (or QEMU). Hope "RFC" makes it less offensive. >>> >>> It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g. >>> 'piix4-usb-uhci' device is attached. The problem with this device is that >>> it uses level-triggered interrupts. >>> >>> The 'hang' scenario develops like this: >>> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One >>> of them is level-triggered. KVM injects the edge-triggered one and >>> requests immediate exit to inject the level-triggered: >>> >>> kvm_set_irq: gsi 23 level 1 source 0 >>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level) >>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge) >>> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge) >>> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge) >>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0 >> >> There is no immediate-exit here. >> Here QEMU just set two pending irqs: vector 80 and vector 96. >> Because vCPU 0 is running at non-root-mode, KVM emulates an exit from L2 to L1 on EXTERNAL_INTERRUPT. >> Note that EXTERNAL_INTERRUPT is emulated on vector 0x60==96 which is the higher vector which is pending which is correct. >> >> BTW, I don’t know why both are set in LAPIC as edge-triggered and not level-triggered. >> But it can be seen from trace pattern that these interrupts are both level-triggered. (See QEMU’s ioapic_service()). >> How did you deduce that one is edge-triggered and the other is >> level-triggered? > > "kvm_apic_accept_irq" event misreports level-triggered interrupts as > edge-triggered, see my "KVM: x86: avoid misreporting level-triggered > irqs as edge-triggered in tracing" patch. > > Other than that I double-checked, both in Qemu and KVM (and there's a > lot of additional debug prints stripped) and I'm certain there's no > disagreement anywhere: gsi 23/vec 80 is a level-triggered interrupt. > >> >>> >>> 2) Hyper-V requires one of its VMs to run to handle the situation but >>> immediate exit happens: >>> >>> kvm_entry: vcpu 0 >>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0 >>> kvm_entry: vcpu 0 >>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0 >>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0 >>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0 >> >> I assume that as part of Hyper-V VMExit handler for EXTERNAL_INTERRUPT, it will forward the interrupt to the host. >> As done in KVM vcpu_enter_guest() calling kvm_x86_ops->handle_external_intr(). >> Because vmcs->vm_exit_intr_info specifies vector 96, we are still left with vector 80 pending. >> >> I also assume that Hyper-V utilise VM_EXIT_ACK_INTR_ON_EXIT and thus vector 96 is cleared from LAPIC IRR >> and the bit in LAPIC ISR for vector 96 is set. >> This is emulated by L0 KVM at nested_vmx_vmexit() -> kvm_cpu_get_interrupt(). >> >> I further assume that at the point that vector 96 runs in L1, interrupts are disabled. >> Afterwards I would expect L1 to enable interrupts (Similar to vcpu_enter_guest() calling local_irq_enable() after kvm_x86_ops->handle_external_intr()). >> >> I would expect Hyper-V handler for vector 96 at some point to do EOI such that when interrupts are later enabled, vector 80 will also get injected. >> All of this before attempting to resume back into L2. >> >> However, it can be seen that indeed at this resume, you receive, after an immediate-exit, an injection of EXTERNAL_INTERRUPT on vector 0x50==80. >> As if Hyper-V never enabled interrupts after handling vector 96 before doing a resume again to L2. >> >> This is still valid of course but just a bit bizarre and >> inefficient. Oh well. :) > > Reverse-engineering is always fun :-) We can report this to Microsoft as-well. :) > >> >>> >>> 3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't >>> process the first one) >> >> Both interrupts are for L1 not L2. >> >>> so it just does 'EOI' for level-triggered interrupt >>> and this causes re-injection: >>> >>> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0 >>> kvm_eoi: apicid 0 vector 80 >>> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26) >>> kvm_set_irq: gsi 23 level 1 source 0 >>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level) >>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge) >>> kvm_entry: vcpu 0 >> >> What happens here is that Hyper-V as a response to second EXTERNAL_INTERRUPT on vector 80, >> it invokes vector 80 handler which performs EOI which is configured in ioapic_exit_bitmap to cause EOI_INDUCED exit to L0. >> The EOI_INDUCED handler will reach handle_apic_eoi_induced() -> kvm_apic_set_eoi_accelerated() -> kvm_ioapic_send_eoi() -> kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT), >> which will cause the exit on KVM_EXIT_IOAPIC_EOI to QEMU as required. >> >> As part of QEMU handling for this exit (ioapic_eoi_broadcast()), it will note that pin’s irr is still set (irq-line was not lowered by vector 80 interrupt handler before EOI), >> and thus vector 80 is re-injected by IOAPIC at ioapic_service(). >> >> If this is indeed a level-triggered interrupt, then it seems buggy to me that vector 80 handler haven’t lowered the irq-line before EOI. >> I would suggest adding a trace to QEMU’s ioapic_set_irq() for when vector=80 and level=0 and ioapic_eoi_broadcast() to verify if this is what happens. >> Another option is that even though vector 80 handler lowered irq-line, it got re-set by device between when it lowered the irq-line and until it did an EOI. Which is legit. >> >> If the former is what happens and vector 80 handler haven’t lowered >> the irq-line before EOI, this is the real root-cause of your issue at >> hand here. > > Yes, exactly - and sorry if I wasn't clear about that. Hyper-V does EOI > for a still pending level-triggered interrupt (as it can't actually fix > the reason - in needs Windows partition - L2 VM - to run and do the > job). > > As a pure speculation I can guess: they don't want to keep a queue of > interrupts which need L2 handling and they count on the fact that doing > EOI with interrupts disabled in L1 -> VMRESUME will actually allow L2 VM > to run for some time and make some progress. It seems very unlikely to me that IOAPIC will delay re-evaluation of irq-lines after an EOI so much that CPU will actually be able to get back into VMX non-root-mode and execute a bunch of code that will eventually lower the irq-line. If L1 requires L2 to handle the level-triggered interrupt, then L1 should not EOI the interrupt but instead just use vmcs12 to inject interrupt to L2 and allow L2 interrupt handler to first lower the irq-line and only then perform EOI. Let’s discuss about this with Microsoft and see what they have to say :) (Let’s open a private email thread with them regarding this. Feel free to Cc me) > >> >>> >>> 4) As we arm preemtion timer again we get stuck in the infinite loop: >>> >>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0 >>> kvm_entry: vcpu 0 >>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0 >>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0 >>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0 >>> kvm_entry: vcpu 0 >>> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0 >>> kvm_eoi: apicid 0 vector 80 >>> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26) >>> kvm_set_irq: gsi 23 level 1 source 0 >>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level) >>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge) >>> kvm_entry: vcpu 0 >>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0 >>> ... >>> >>> How does this work on hardware? I don't really know but it seems that we >>> were dealing with similar issues before, see commit 184564efae4d7 ("kvm: >>> ioapic: conditionally delay irq delivery duringeoi broadcast"). In case EOI >>> doesn't always cause an *immediate* interrupt re-injection some progress >>> can be made. >> >> This commit tries to handle with a bug exactly as I have described above. >> In case a level-triggered interrupt handler in guest performs EOI before it lowers the irq-line, it will be re-injected in a loop forever with the level-triggered interrupt. >> The reason this doesn’t happen on real hardware is because there is a very short delay between when an EOI is performed to IOAPIC and until IOAPIC attempts to serve a new interrupt. >> >> To handle this, this commit delays the evaluation of IOAPIC new pending interrupts when it receives an EOI for a level-triggered interrupt. >> >>> >>> An alternative approach would be to port the above mentioned commit to >>> QEMU's ioapic implementation. I'm, however, not sure that level-triggered >>> interrupts is a must to trigger the issue. >> >> This seems like the right solution to me. >> It seems to me that I have actually pinpointed it in this analysis >> directly to level-triggered interrupts and it is a must to trigger the >> issue. > > Yes, in this particular case it is directly linked to (mis-)handling of > a level-triggered interrupt. > >> >> This also explains why the issue doesn’t reproduce when you are using in-kernel irqchip. >> It also explains why "piix4-usb-uhci” must be present to re-produce >> this issue as probably the bug is in that device interrupt handler. > > Level-triggered interrupts are rather rare but still present, I think > that any device using them can cause similar hang. Why? This kind of hang should occur only if guest have buggy level-triggered interrupt handler. If handler is written properly such that it lowers irq-line and only then does EOI, there is no reason it will hang. (If between the irq-line lowering and the EOI the line gets asserted again, then it is legit that there will be another interrupt as it should be handled). -Liran > >> (We should probably report this bug to Microsoft. Do we have any >> contacts for this?) > > Yes, I just wanted to get KVM people's opinion first. > >> >> Do you agree with above analysis? > > Thank you for it! I think we're in violent agreement here :-) > >> -Liran >> >>> >>> HV_TIMER_THROTTLE/HV_TIMER_DELAY are more or less arbitrary. I haven't >>> looked at SVM yet but their immediate exit implementation does >>> smp_send_reschedule(), I'm not sure this can cause the above described >>> lockup. >>> >>> Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> >> >> As mentioned above, I think this patch should be dropped. >> > > Yes, of course, I never expected it to be merged. I am, however, afraid > that we'll have to get back to this discussion if another scenario with > 'immediate eoi storm' appears. I can't think of anything in particular > right away so I'm going to implement the throttling for level-triggred > interrupt re-assertion in Qemu first. > >>> --- >>> arch/x86/kvm/vmx/vmcs.h | 2 ++ >>> arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++- >>> 2 files changed, 25 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h >>> index cb6079f8a227..983dfc60c30c 100644 >>> --- a/arch/x86/kvm/vmx/vmcs.h >>> +++ b/arch/x86/kvm/vmx/vmcs.h >>> @@ -54,6 +54,8 @@ struct loaded_vmcs { >>> bool launched; >>> bool nmi_known_unmasked; >>> bool hv_timer_armed; >>> + unsigned long hv_timer_lastrip; >>> + int hv_timer_lastrip_cnt; >>> /* Support for vnmi-less CPUs */ >>> int soft_vnmi_blocked; >>> ktime_t entry_time; >>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c >>> index 617443c1c6d7..8a49ec14dd3a 100644 >>> --- a/arch/x86/kvm/vmx/vmx.c >>> +++ b/arch/x86/kvm/vmx/vmx.c >>> @@ -6321,14 +6321,36 @@ static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val) >>> vmx->loaded_vmcs->hv_timer_armed = true; >>> } >>> >>> +#define HV_TIMER_THROTTLE 100 >>> +#define HV_TIMER_DELAY 1000 >>> + >>> static void vmx_update_hv_timer(struct kvm_vcpu *vcpu) >>> { >>> struct vcpu_vmx *vmx = to_vmx(vcpu); >>> u64 tscl; >>> u32 delta_tsc; >>> >>> + /* >>> + * Some guests are buggy and may behave in a way that causes KVM to >>> + * always request immediate exit (e.g. of a nested guest). At the same >>> + * time guest progress may be required to resolve the condition. >>> + * Throttling below makes sure we never get stuck completely. >>> + */ >>> if (vmx->req_immediate_exit) { >>> - vmx_arm_hv_timer(vmx, 0); >>> + unsigned long rip = kvm_rip_read(vcpu); >>> + >>> + if (vmx->loaded_vmcs->hv_timer_lastrip == rip) { >>> + ++vmx->loaded_vmcs->hv_timer_lastrip_cnt; >>> + } else { >>> + vmx->loaded_vmcs->hv_timer_lastrip_cnt = 0; >>> + vmx->loaded_vmcs->hv_timer_lastrip = rip; >>> + } >>> + >>> + if (vmx->loaded_vmcs->hv_timer_lastrip_cnt > HV_TIMER_THROTTLE) >>> + vmx_arm_hv_timer(vmx, HV_TIMER_DELAY); >>> + else >>> + vmx_arm_hv_timer(vmx, 0); >>> + >>> return; >>> } >>> >>> -- >>> 2.20.1 >>> >> > > -- > Vitaly