On Wed, Jan 26, 2011 at 10:52, Avi Kivity <avi@xxxxxxxxxx> wrote: > On 01/25/2011 08:29 PM, Ruben Kerkhof wrote: >> >> > ÂWhen you say "suddenly", this was with no changes to software and >> > hardware? >> >> The host software and hardware hasn't changed in the two months since >> the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13. >> >> We host customer vms on it though, so virtual machines come and go. >> Various operating systems, a mixture of Linux, FreeBSD and Windows >> 2008 R2. We have other machines with the same config without these >> problems though. > > Are those other machines running a similar workload? Yes, similar, or they're more heavily loaded. On this machine, about half of the 48GB memory was used for virtual machines. > The traces look awfully like bad hardware, though that can also be explained > by random memory corruption due to a bug. Yeah, that's what I'm expecting. We already replaced the memory, next step is to move the disks over to another server to make sure it's not the board or cpu's. >> This time I have a few different messages though: >> >> 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault: >> 0000 [#1] SMP >> >> RSI: 0000000000000000 RDI: 1603a07305001568 >> >> 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46 >> 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d >> 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00<f0> Âff 4f 08 0f 94 c0 84 >> c0 74 10 85 f6 75 07 e8 63 fe ff ff eb > > lock decl 0x8(%rdi) > > %rdi is completely crap, looks like corruption again. ÂStrangely, it is > similar to the bad spte from the previous trace: 0x1603a0730500d277. ÂThe > upper 48 bits are identical, the lower 16 bits are different.: >> >> 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted >> page table at address 7f37b37ff000 >> 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD >> 94e538067 PMD 61e5bf067 PTE 1603a0730500e067 > > Here are those magic 48 bits again, in the PTE entry. >> >> 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration. >> 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038 >> 2011-01-25T12:38:49.417526+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4 >> 2011-01-25T12:38:49.417532+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x5db595007 level 3 >> 2011-01-25T12:38:49.417553+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2 >> 2011-01-25T12:38:49.417558+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1 > > Again. > >> 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in >> process qemu-kvm Âpte:1603a0730500d067 pmd:61059f067 > > Again. > > However, these all came from a single boot, yes? Correct. > If so they can be the same > corruption. ÂPlease collect more traces, with reboots in between. Ok, thanks, will do. Kind regards, Ruben -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html