[Bug] Kdump does not work when panic triggered due to MCE

prasad@xxxxxxxxxxxxxxxxxx (K.Prasad) · Mon, 9 May 2011 22:33:36 +0530

On Mon, May 09, 2011 at 05:21:06PM +0200, Bouchard Louis wrote:
> Hello,
> 
> Le 09/05/2011 14:39, Vivek Goyal a ?crit :
> >
> > Prasad,
> >
> > I have never tried taking dump in MCE situation. Does kdump work on this
> > machine with normal panic()?
> >
> > Use --debug and --serial option in kexec-tools to print some debug message
> > and look for "I am in purgatory". This will tell you whether you hanged
> > in first kernel or second kernel.
> >
> > Then put "outb()" messages in the kernel to trace what happened. 
> >
> > Thanks
> > Vivek
> >
> > _______________________________________________
> > kexec mailing list
> > kexec at lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> I have seen numerous occurrences of MCE triggered kernel panics on both
> RHEL & SLES environment used on IA32 architecture. Both in contexts
> where kexec/kdump was being used.
>

That's interesting! Assuming that these are not software induced MCEs
but panic() calls invoked due to unrecoverable memory errors in a
physical machine, did you experience any situation where the kdump
kernel hung/rebooted due to a second MCE (triggered while reading the
faulty memory location belonging to the first kernel)?

>  Matter of fact, MCE triggered panic are part of the reason that pushed
> me to work on crashdc : only one crash command is required to get the
> MCE trace out of the kernel ring buffer. This avoids transfering massive
> amount of vmcore file over the net.
> 

What is the data that is contained in the faulty memory location (whose
I/O triggered an MCE in the first place)? Basically we'd like to
understand what a 'read' operation on the corrupted memory location
would result in.

> crashdc does well on those, mcelog can be applied on the data gathered.
>

We're contemplating a solution on the similar lines (refer the
description of 'slim' kdump at https://lkml.org/lkml/2011/5/4/396) to
create a 'crash tool readable coredump containing a message that
indicates the cause of the crash as MCE (and not any data from the old
memory).

I'll take a look at the crashdc code and see if there are ideas that we
can borrow from there.

Thanks,
K.Prasad