Re: [Qemu-devel] E5-2620v2 - emulation stop error

Andrey Korolyov <andrey@xxxxxxx> · Tue, 10 Mar 2015 21:08:49 +0300



On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
<dgilbert@xxxxxxxxxx> wrote:
> * Andrey Korolyov (andrey@xxxxxxx) wrote:
>> On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>> > On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das <bsd@xxxxxxxxxx> wrote:
>> >> Andrey Korolyov <andrey@xxxxxxx> writes:
>> >>
>> >>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>> >>>> Hello,
>> >>>>
>> >>>> recently I`ve got a couple of shiny new Intel 2620v2s for future
>> >>>> replacement of the E5-2620v1, but I experienced relatively many events
>> >>>> with emulation errors, all traces looks simular to the one below. I am
>> >>>> running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
>> >>>> can switch to some other versions if necessary. Most of crashes
>> >>>> happened during reboot cycle or at the end of ACPI-based shutdown
>> >>>> action, if this can help. I have zero clues of what can introduce such
>> >>>> a mess inside same processor family using identical software, as
>> >>>> 2620v1 has no simular problem ever. Please let me know if there can be
>> >>>> some side measures for making entire story more clear.
>> >>>>
>> >>>> Thanks!
>> >>>>
>> >>>> KVM internal error. Suberror: 2
>> >>>> extra data[0]: 800000d1
>> >>>> extra data[1]: 80000b0d
>> >>>> EAX=00000003 EBX=00000000 ECX=00000000 EDX=00000000
>> >>>> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006cd4
>> >>>> EIP=0000d3f9 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> >>>> ES =0000 00000000 0000ffff 00009300
>> >>>> CS =f000 000f0000 0000ffff 00009b00
>> >>>> SS =0000 00000000 0000ffff 00009300
>> >>>> DS =0000 00000000 0000ffff 00009300
>> >>>> FS =0000 00000000 0000ffff 00009300
>> >>>> GS =0000 00000000 0000ffff 00009300
>> >>>> LDT=0000 00000000 0000ffff 00008200
>> >>>> TR =0000 00000000 0000ffff 00008b00
>> >>>> GDT=     000f6e98 00000037
>> >>>> IDT=     00000000 000003ff
>> >>>> CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
>> >>>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
>> >>>> DR3=0000000000000000
>> >>>> DR6=00000000ffff0ff0 DR7=0000000000000400
>> >>>> EFER=0000000000000000
>> >>>> Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb <cd>
>> >>>> 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
>> >>>> b8 00 e0 00 00 8e
>> >>>
>> >>>
>> >>> It turns out that those errors are introduced by APICv, which gets
>> >>> enabled due to different feature set. If anyone is interested in
>> >>> reproducing/fixing this exactly on 3.10, it takes about one hundred of
>> >>> migrations/power state changes for an issue to appear, guest OS can be
>> >>> Linux or Win.
>> >>
>> >> Are you able to reproduce this on a more recent upstream kernel as well ?
>> >>
>> >> Bandan
>> >
>> > I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
>> > follow up with any reproduceable results.
>>
>> Heh.. issue is not triggered on 2603v2 at all, at least I am not able
>> to hit this. The only difference with 2620v2 except lower frequency is
>> an Intel Dynamic Acceleration feature. I`d appreciate any testing with
>> higher CPU models with same or richer feature set. The testing itself
>> can be done on both generic 3.10 or RH7 kernels, as both of them are
>> experiencing this issue. I conducted all tests with disabled cstates
>> so I advise to do the same for a first reproduction step.
>>
>> Thanks!
>>
>> model name      : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
>> stepping        : 4
>> microcode       : 0x416
>> cpu MHz         : 2100.039
>> cache size      : 15360 KB
>> siblings        : 12
>> apicid          : 43
>> initial apicid  : 43
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 13
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
>> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
>> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
>> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
>> sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
>> rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
>> flexpriority ept vpid fsgsbase smep erms
>
> I'm seeing something similar; it's very intermittent and generally
> happening right at boot of the guest;   I'm running this on qemu
> head+my postcopy world (but it's happening right at boot before postcopy
> gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
> but hey maybe I'm seeing a different bug.
>
> Dave

Yep, looks like we are hitting same bug - two thirds of mine failure
events shot during boot/reboot cycle and approx. one third of events
happened in the middle of runtime. What CPU, v0 or v2 are you using
(in other words, is APICv enabled)?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html