Hey Tony, a "welcome back" is in order? :-) On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: > If the system had experienced some memory corruption, but > recovered ... then there would be some pages sitting around > that the old kernel had marked as POISON and stopped using. > The kexec'd kernel doesn't know about these, so may touch that > memory while taking a crash dump ... Hmm, pass a list of poisoned pages to the kdump kernel so as not to touch. Looks like there's already functionality for that: "makedumpfile can exclude the following types of pages while copying VMCORE to DUMPFILE, and a user can choose which type of pages will be excluded. - Pages filled with zero - Cache pages - User process data pages - Free pages" (there is a makedumpfile manpage somewhere) And apparently crash knows about poisoned pages and handles them: static int __init crash_save_vmcoreinfo_init(void) { ... #ifdef CONFIG_MEMORY_FAILURE VMCOREINFO_NUMBER(PG_hwpoison); #endif so if that works, the kexeced kernel should know about that list. > and then you have a broadcast machine check (on older[1] Intel CPUs > that don't support local machine check). Right. > This is hard to work around. You really need all the CPUs to have set > CR4.MCE=1 (if any didn't, then they will force a reset when they see > the machine check). Also you need to make sure that they jump to the > copy of do_machine_check() in the new kernel, not the old kernel. Doesn't matter, right? The new copy is as clueless as the old one about those MCEs. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.