On Wed, May 04, 2011 at 10:39:14PM +0200, Andi Kleen wrote: > > Any thoughts/suggestions? > > My old attempts to solve this are > > Don't dump on MCE: > > http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=shortlog;h=refs/heads/mce/xpanic > The problem we seen in avoiding a panic->crash_kexec->[coredump capture] is that the user may not have a means to know the reason for crash, unless the serial console is connected to capture and store the panic string. Alternatively a 'slim' kdump (as described here: https://lkml.org/lkml/2011/5/4/396) would not contain meaningless data from the old memory, but inform the user about the cause of the crash. I'm intending to post some patches with a quick implementation of it soon. > Handle dumps of corrupted memory regresions: > > http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=shortlog;h=refs/heads/mce/crashdump > > IMHO these patches are still the right solutions for this. > Like Vatsa had raised, the processor's behaviour upon reading (or any I/O operation) the faulty memory location isn't clearly defined (to the extent I read through System Programming Guide Part 1, Volume 3A, Chapter 15). In such a scenario, disabling MCE for the kdump kernel (which can potentially read the faulty memory) is making things hazy. Thanks, K.Prasad