[RFC][PATCH] Add a sysctl option controlling kexec when MCE occurred

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Sat, 25 Dec 2010 13:40:07 -0800

"H. Peter Anvin" <hpa at zytor.com> writes:

> On 12/25/2010 09:19 AM, Eric W. Biederman wrote:
>>>
>>> So, kdump may receive wrong identifier when it starts after MCE 
>>> occurred, because MCE is reported by memory, cache, and TLB errors
>>>
>>> In the worst case, kdump will overwrite user data if it recognizes a 
>>> disk saving user data as a dump disk.
>> 
>> Absurdly unlikely there is a sha256 checksum verified over the
>> kdump kernel before it starts booting.  If you have very broken
>> memory it is possible, but absurdly unlikely that the machine will
>> even boot if you are having enough uncorrectable memory errors
>> an hour to get past the sha256 checksum and then be corruppt.
>> 
>
> That wouldn't be the likely scenario (passing a sha256 checksum with the
> wrong data due to a random event will never happen for all the computers
> on Earth before the Sun destroys the planet).  However, in a
> failing-memory scenario, the much more likely scenario is that kdump
> starts up, verifies the signature, and *then* has corruption causing it
> to write to the wrong disk or whatnot.  This is inherent in any scheme
> that allows writing to hard media after a failure (as opposed to, say,
> dumping to the network.)

Then kdump kernel should also panic if we detect an uncorrected ECC
error.  So this doesn't appear to open any new holes for disk corruption.

kexec on panic can also be used for taking crash dumps over the
network.  What happens with the data is totally defined by userspace
code in an initrd.

Which is why extra policy knobs should be where they can be used.

Eric