RE: [PATCH v2] Dump cper error table in mce_panic

"Luck, Tony" <tony.luck@xxxxxxxxx> · Thu, 28 Jan 2021 17:22:30 +0000

> The even better way to detect this is to be able to check whether this
> is the kdump kernel and whether it got loaded due to a fatal MCE in the
> first kernel and then match that error address with the error address of
> the error which caused the first panic in the mce code. Then the second
> kernel won't need to panic but simply log.

The biggest problem with all of the logging (whether in machine check
banks, or in error records from BIOS) is the lack of a timestamp. If there
was a way to tell if this "just happened", or "happened a while ago" then
such "take action" or "just log" decisions would be simpler.

Maybe you don't need to do *all* those matching checks.  Just a flag
from the first kernel to say "I died from a fatal machine check" could
be used to tell the kdump kernel "just log the cper" stuff.

If the system is broken enough that more machine checks are still
firing in the kdump kernel ... then you would miss trying to recover.
But if more machine checks are happening, then the kdump kernel
is likely doomed anyway.

Getting a full memory dump after a machine check generally isn't
all that useful anyway. The problem was (almost certainly) h/w, so
not much benefit in decoding the dump to find which code was running
when the h/w signalled.

A second bite at getting the error logs from the death of the first
kernel is worth it though.

-Tony