2015-03-27 13:16+0300, Andrey Korolyov: > On Fri, Mar 27, 2015 at 12:03 AM, Bandan Das <bsd@xxxxxxxxxx> wrote: > > Radim Krčmář <rkrcmar@xxxxxxxxxx> writes: > >> I second Bandan -- checking that it reproduces on other machine would be > >> great for sanity :) (Although a bug in our APICv is far more likely.) > > > > If it's APICv related, a run without apicv enabled could give more hints. > > > > Your "devices not getting reset" hypothesis makes the most sense to me, > > maybe the timer vector in the error message is just one part of > > the whole story. Another misbehaving interrupt from the dark comes in at the > > same time and leads to a double fault. > > Default trace (APICv enabled, first reboot introduced the issue): > http://xdel.ru/downloads/kvm-e5v2-issue/hanged-reboot-apic-on.dat.gz The relevant part is here, prefixed with "qemu-system-x86-4180 [002] 697.111550:" kvm_exit: reason CR_ACCESS rip 0xd272 info 0 0 kvm_cr: cr_write 0 = 0x10 kvm_mmu_get_page: existing sp gfn 0 0/4 q0 direct --- !pge !nxe root 0 sync kvm_entry: vcpu 0 kvm_emulate_insn: f0000:d275: ea 7a d2 00 f0 kvm_emulate_insn: f0000:d27a: 2e 0f 01 1e f0 6c kvm_emulate_insn: f0000:d280: 31 c0 kvm_emulate_insn: f0000:d282: 8e e0 kvm_emulate_insn: f0000:d284: 8e e8 kvm_emulate_insn: f0000:d286: 8e c0 kvm_emulate_insn: f0000:d288: 8e d8 kvm_emulate_insn: f0000:d28a: 8e d0 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0xd28f info 0 800000f6 kvm_entry: vcpu 0 kvm_exit: reason EPT_VIOLATION rip 0x8dd0 info 184 0 kvm_page_fault: address f8dd0 error_code 184 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0x8dd0 info 0 800000f6 kvm_entry: vcpu 0 kvm_exit: reason EPT_VIOLATION rip 0x76d6 info 184 0 kvm_page_fault: address f76d6 error_code 184 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0x76d6 info 0 800000f6 kvm_entry: vcpu 0 kvm_exit: reason PENDING_INTERRUPT rip 0xd331 info 0 0 kvm_inj_virq: irq 8 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 800000f6 kvm_entry: vcpu 0 kvm_exit: reason EPT_VIOLATION rip 0xfea5 info 184 0 kvm_page_fault: address ffea5 error_code 184 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 800000f6 kvm_entry: vcpu 0 kvm_exit: reason EPT_VIOLATION rip 0xe990 info 184 0 kvm_page_fault: address fe990 error_code 184 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0xe990 info 0 800000f6 kvm_entry: vcpu 0 kvm_exit: reason EXCEPTION_NMI rip 0xd334 info 0 80000b0d kvm_userspace_exit: reason KVM_EXIT_INTERNAL_ERROR (17) > Trace without APICv (three reboots, just to make sure to hit the > problematic condition of supposed DF, as it still have not one hundred > percent reproducibility): > http://xdel.ru/downloads/kvm-e5v2-issue/apic-off.dat.gz The trace here contains a well matching excerpt, just instead of the EXCEPTION_NMI, it does 169.905098: kvm_exit: reason EPT_VIOLATION rip 0xd334 info 181 0 169.905102: kvm_page_fault: address feffd066 error_code 181 and works. Page fault says we tried to read 0xfeffd066 -- probably IOPB of TSS. (I guess it is pre-fetch for following IO instruction.) Nothing strikes me when looking at it, but some APICv boots don't fail, so it would be interesting to compare them ... hosts's 0xf6 interrupt (IRQ_WORK_VECTOR) is a possible source of races. (We could look more closely. It is fired too often for my liking as well.) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html