On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert <dgilbert@xxxxxxxxxx> wrote: > * Andrey Korolyov (andrey@xxxxxxx) wrote: >> On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >> > On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das <bsd@xxxxxxxxxx> wrote: >> >> Andrey Korolyov <andrey@xxxxxxx> writes: >> >> >> >>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >> >>>> Hello, >> >>>> >> >>>> recently I`ve got a couple of shiny new Intel 2620v2s for future >> >>>> replacement of the E5-2620v1, but I experienced relatively many events >> >>>> with emulation errors, all traces looks simular to the one below. I am >> >>>> running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but >> >>>> can switch to some other versions if necessary. Most of crashes >> >>>> happened during reboot cycle or at the end of ACPI-based shutdown >> >>>> action, if this can help. I have zero clues of what can introduce such >> >>>> a mess inside same processor family using identical software, as >> >>>> 2620v1 has no simular problem ever. Please let me know if there can be >> >>>> some side measures for making entire story more clear. >> >>>> >> >>>> Thanks! >> >>>> >> >>>> KVM internal error. Suberror: 2 >> >>>> extra data[0]: 800000d1 >> >>>> extra data[1]: 80000b0d >> >>>> EAX=00000003 EBX=00000000 ECX=00000000 EDX=00000000 >> >>>> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006cd4 >> >>>> EIP=0000d3f9 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 >> >>>> ES =0000 00000000 0000ffff 00009300 >> >>>> CS =f000 000f0000 0000ffff 00009b00 >> >>>> SS =0000 00000000 0000ffff 00009300 >> >>>> DS =0000 00000000 0000ffff 00009300 >> >>>> FS =0000 00000000 0000ffff 00009300 >> >>>> GS =0000 00000000 0000ffff 00009300 >> >>>> LDT=0000 00000000 0000ffff 00008200 >> >>>> TR =0000 00000000 0000ffff 00008b00 >> >>>> GDT= 000f6e98 00000037 >> >>>> IDT= 00000000 000003ff >> >>>> CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000 >> >>>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 >> >>>> DR3=0000000000000000 >> >>>> DR6=00000000ffff0ff0 DR7=0000000000000400 >> >>>> EFER=0000000000000000 >> >>>> Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb <cd> >> >>>> 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66 >> >>>> b8 00 e0 00 00 8e >> >>> >> >>> >> >>> It turns out that those errors are introduced by APICv, which gets >> >>> enabled due to different feature set. If anyone is interested in >> >>> reproducing/fixing this exactly on 3.10, it takes about one hundred of >> >>> migrations/power state changes for an issue to appear, guest OS can be >> >>> Linux or Win. >> >> >> >> Are you able to reproduce this on a more recent upstream kernel as well ? >> >> >> >> Bandan >> > >> > I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and >> > follow up with any reproduceable results. >> >> Heh.. issue is not triggered on 2603v2 at all, at least I am not able >> to hit this. The only difference with 2620v2 except lower frequency is >> an Intel Dynamic Acceleration feature. I`d appreciate any testing with >> higher CPU models with same or richer feature set. The testing itself >> can be done on both generic 3.10 or RH7 kernels, as both of them are >> experiencing this issue. I conducted all tests with disabled cstates >> so I advise to do the same for a first reproduction step. >> >> Thanks! >> >> model name : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz >> stepping : 4 >> microcode : 0x416 >> cpu MHz : 2100.039 >> cache size : 15360 KB >> siblings : 12 >> apicid : 43 >> initial apicid : 43 >> fpu : yes >> fpu_exception : yes >> cpuid level : 13 >> wp : yes >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe >> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts >> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq >> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca >> sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c >> rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi >> flexpriority ept vpid fsgsbase smep erms > > I'm seeing something similar; it's very intermittent and generally > happening right at boot of the guest; I'm running this on qemu > head+my postcopy world (but it's happening right at boot before postcopy > gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case > but hey maybe I'm seeing a different bug. > > Dave Yep, looks like we are hitting same bug - two thirds of mine failure events shot during boot/reboot cycle and approx. one third of events happened in the middle of runtime. What CPU, v0 or v2 are you using (in other words, is APICv enabled)? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html