On Thu, Jun 28, 2018 at 11:07:22AM +0900, gregkh@xxxxxxxxxxxxxxxxxxx wrote: > > The patch below does not apply to the 4.4-stable tree. > If someone wants it applied there, or to any other stable or longterm > tree, then please email the backport, including the original git commit > id to <stable@xxxxxxxxxxxxxxx>. > > thanks, This patch relies on: 3acb431b84d8 ("x86/mce: Detect local MCEs properly") cherry pick that (and fix up the trivial merge problem around the change to initialize "lmce = 1;" instead of "lmce = 0";) Then this will merge cleanly. -Tony > > ------------------ original commit in Linus's tree ------------------ > > From 40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2 Mon Sep 17 00:00:00 2001 > From: Tony Luck <tony.luck@xxxxxxxxx> > Date: Fri, 22 Jun 2018 11:54:23 +0200 > Subject: [PATCH] x86/mce: Fix incorrect "Machine check from unknown source" > message > > Some injection testing resulted in the following console log: > > mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134 > mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]} > mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88 > mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043 > mce: [Hardware Error]: Run the above through 'mcelog --ascii' > Kernel panic - not syncing: Machine check from unknown source > > This confused everybody because the first line quite clearly shows > that we found a logged error in "Bank 1", while the last line says > "unknown source". > > The problem is that the Linux code doesn't do the right thing > for a local machine check that results in a fatal error. > > It turns out that we know very early in the handler whether the > machine check is fatal. The call to mce_no_way_out() has checked > all the banks for the CPU that took the local machine check. If > it says we must crash, we can do so right away with the right > messages. > > We do scan all the banks again. This means that we might initially > not see a problem, but during the second scan find something fatal. > If this happens we print a slightly different message (so I can > see if it actually every happens). > > [ bp: Remove unneeded severity assignment. ] > > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx> > Signed-off-by: Borislav Petkov <bp@xxxxxxx> > Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Cc: Ashok Raj <ashok.raj@xxxxxxxxx> > Cc: Dan Williams <dan.j.williams@xxxxxxxxx> > Cc: Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx> > Cc: linux-edac <linux-edac@xxxxxxxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx # 4.2 > Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@xxxxxxxxx > > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index 7e6f51a9d917..e93670d736a6 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -1207,13 +1207,18 @@ void do_machine_check(struct pt_regs *regs, long error_code) > lmce = m.mcgstatus & MCG_STATUS_LMCES; > > /* > + * Local machine check may already know that we have to panic. > + * Broadcast machine check begins rendezvous in mce_start() > * Go through all banks in exclusion of the other CPUs. This way we > * don't report duplicated events on shared banks because the first one > - * to see it will clear it. If this is a Local MCE, then no need to > - * perform rendezvous. > + * to see it will clear it. > */ > - if (!lmce) > + if (lmce) { > + if (no_way_out) > + mce_panic("Fatal local machine check", &m, msg); > + } else { > order = mce_start(&no_way_out); > + } > > for (i = 0; i < cfg->banks; i++) { > __clear_bit(i, toclear); > @@ -1289,12 +1294,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) > no_way_out = worst >= MCE_PANIC_SEVERITY; > } else { > /* > - * Local MCE skipped calling mce_reign() > - * If we found a fatal error, we need to panic here. > + * If there was a fatal machine check we should have > + * already called mce_panic earlier in this function. > + * Since we re-read the banks, we might have found > + * something new. Check again to see if we found a > + * fatal error. We call "mce_severity()" again to > + * make sure we have the right "msg". > */ > - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) > - mce_panic("Machine check from unknown source", > - NULL, NULL); > + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { > + mce_severity(&m, cfg->tolerant, &msg, true); > + mce_panic("Local fatal machine check!", &m, msg); > + } > } > > /* >