On 02/20/2017 at 09:29 PM, Xunlei Pang wrote: > On 02/20/2017 at 07:09 PM, Borislav Petkov wrote: >> On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote: >>> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long error_code) >>> */ >>> int lmce = 1; >>> >>> - /* If this CPU is offline, just bail out. */ >>> - if (cpu_is_offline(smp_processor_id())) { >>> + /* If nmi shootdown happened or this CPU is offline, just bail out. */ >>> + if (cpus_shotdown() || >> I don't like "cpus_shotdown" - it doesn't hint at all that this is >> special-handling crash/kdump. >> >> And more importantly, I want it to be obvious that we do let the >> crashing CPU into the MCE handler. > Ok, I will export crashing_cpu and use it directly in mce handler. Forget to mention, one reason I introduced cpus_shotdown() is that "crashing_cpu" is defined with CONFIG_SMP=y, so we have to export it unconditionally if we don't want to add the conditional code(i.e. with #ifdef CONFIG_SMP quoted) in mce.c. Regards, Xunlei > >> Why? >> >> If we didn't, you will not handle *any* MCE, even a fatal one, during >> dumping memory so if that dump is corrupted from the MCE, you won't >> know. And I don't want to be the one staring at the corrupted dump and >> wondering why I'm seeing what I'm seeing. >> >> IOW, if we get a fatal MCE during dumping then we should go and die. >> This is much better than silently corrupting the dump and not even >> saying anything about it. >> > My thought is that it doesn't matter after kdump boots as new mce handler > will be installed. If we get a fatal MCE during kdumping, the new handler will > handle the cpus running kdump kernel correctly. > > There is a small window between crash and kdump kernel boot, so if a SRAO comes > within this window it will also cause the mce synchronization problem on the crashing > cpu if we don't bail out the crashing cpu. > > Regards, > Xunlei