On 02/16/2017 at 08:22 PM, Borislav Petkov wrote: > On Thu, Feb 16, 2017 at 07:52:09PM +0800, Xunlei Pang wrote: >> then mce will be broadcast to the other cpus which are still running >> in the first kernel(i.e. looping in crash_nmi_callback). > Simple: the crash code should really mark CPUs as not being online: > > void do_machine_check(struct pt_regs *regs, long error_code) > > ... > > /* If this CPU is offline, just bail out. */ > if (cpu_is_offline(smp_processor_id())) { > u64 mcgstatus; > > mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); > if (mcgstatus & MCG_STATUS_RIPV) { > mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); > return; > } > } > > because looping in crash_nmi_callback() does not really denote them as > CPUs being online. > > And just so that you don't disturb the machine too much during crashing, > you could simply clear them from the online masks, i.e., perhaps call > remove_cpu_from_maps() with the proper locking around it instead of > doing a full cpu_down(). It changes the value of cpu_online_mask/etc which will cause confusion to vmcore analysis. Moreover, for the code(see comment inlined) if (cpu_is_offline(smp_processor_id())) { u64 mcgstatus; mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); if (mcgstatus & MCG_STATUS_RIPV) { // This condition may be not true, the mce triggered on kdump cpu // doesn't need to have this bit set for the other cpus remain in 1st kernel. mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); return; } } Regards, Xunlei > > The machine will be killed anyway after kdump is done writing out > memory. >