Re: Major KVM issues with kernel 4.5 on the host

Borislav Petkov <bp@xxxxxxxxx> · Sat, 23 Apr 2016 18:04:29 +0200

On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> Yes, but there are two symptoms. The VM either suffers file system
> issues (garbage read from files, or an aborted ext4 journal and
> following ro remount) or it stops dead in its tracks.

Stops dead? What does that mean exactly? Box is wedged solid and it
doesn't react to any key presses?

Because if so, this could really be a DRAM going bad and a correctable
error turning into an uncorrectable. How old is the DRAM in that box?
Judging by your CPU, it should be a couple of years...

> The longest trigger time I have seen was three hours, I tripled that
> to nine hours, that probably was not enough.

So enlarge even more I guess.

> The box reports about one correctable error per week, so I probably
> have a faulty DIMM, but since the issue only surfaces in VMs while the
> host system is in perfect working order...

So it could be that correctable error turns into an uncorrectable one at
some point. But then you should be getting an exception...

> And yes, I am pondering to simply replace the box with an Intel CPU.

Your CPU is fine, from what I've seen so far.

> I see "mce: CPU supports 6 MCE banks" once for each reboot, and about
> 30 "Machine check events logged" since January. How do I see which
> events were logged?

Hmm, you have

[   18.149300] MCE: In-kernel MCE decoding enabled.

that's CONFIG_EDAC_DECODE_MCE, so you should have some "Hardware Error"
lines in dmesg, I'd guess, decoding the errors.

> So you basically select the default for new options.

Yap.

> I go the way of Debian packages since it is easier to handle the
> crypto file systems when the machine is booting up.

As long as you're testing the correct bisection kernels...

> And yes, I think about doing a test reinstall on unencrypted disk to
> find out whether encryption plays a role, but I currently need the
> machine to urgently to take it out of serice for half a month, and,
> again, the host system is in perfect working order, it is just VMs
> that barf.

Yeah, I can't reproduce it here and I have a very similar box to yours
which is otherwise idle, more or less.

Another fact which points to potentially DIMM going bad...

> I check the date of the package I am installing and the date stamp of
> the kernels being installed to /boot. I'm reasonably sure I have that
> under control.

Good.

> ... and if testing a "good" kernel means a day.

Yeah, it is annoying. In a perfect world, we all should have two
identical boxes so that we use one as a workstation and the second for
testing when the first one, the workstation barfs. I should bring that
up with my manager next time... :-)

> And whenever 46896c73c1a4 is present, I need to apply Paolo's patch,
> right?

Yap.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html