On 02/20/2017 at 07:09 PM, Borislav Petkov wrote: > On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote: >> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long error_code) >> */ >> int lmce = 1; >> >> - /* If this CPU is offline, just bail out. */ >> - if (cpu_is_offline(smp_processor_id())) { >> + /* If nmi shootdown happened or this CPU is offline, just bail out. */ >> + if (cpus_shotdown() || > I don't like "cpus_shotdown" - it doesn't hint at all that this is > special-handling crash/kdump. > > And more importantly, I want it to be obvious that we do let the > crashing CPU into the MCE handler. Ok, I will export crashing_cpu and use it directly in mce handler. > > Why? > > If we didn't, you will not handle *any* MCE, even a fatal one, during > dumping memory so if that dump is corrupted from the MCE, you won't > know. And I don't want to be the one staring at the corrupted dump and > wondering why I'm seeing what I'm seeing. > > IOW, if we get a fatal MCE during dumping then we should go and die. > This is much better than silently corrupting the dump and not even > saying anything about it. > My thought is that it doesn't matter after kdump boots as new mce handler will be installed. If we get a fatal MCE during kdumping, the new handler will handle the cpus running kdump kernel correctly. There is a small window between crash and kdump kernel boot, so if a SRAO comes within this window it will also cause the mce synchronization problem on the crashing cpu if we don't bail out the crashing cpu. Regards, Xunlei