Hi, Vishal, Vishal Verma <vishal.l.verma@xxxxxxxxx> writes: > The mce handler for 'nfit' devices is called for memory errors on a > Non-Volatile DIMM, and adds the error location to a 'badblocks' list. > This list is used by the various NVDIMM drivers to avoid consuming known > poison locations during IO. > > The mce handler gets called for both corrected and uncorrectable errors. Sorry for necroposting. I thought the point of the CEC was to make sure that the other registered decoders only ever saw uncorrected errors. How do we end up getting called with a correctable error? Thanks, Jeff > Until now, both kinds of errors have been added to the badblocks list. > However, corrected memory errors indicate that the problem has already > been fixed by hardware, and the resulting interrupt is merely a > notification to Linux. As far as future accesses to that location are > concerned, it is perfectly fine to use, and thus doesn't need to be > included in the above badblocks list. > > Add a check in the nfit mce handler to filter out corrected mce events, > and only process uncorrectable errors. > > Reported-by: Omar Avelar <omar.avelar@xxxxxxxxx> > Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media error") > Cc: stable@xxxxxxxxxxxxxxx > Cc: Dan Williams <dan.j.williams@xxxxxxxxx> > Cc: Tony Luck <tony.luck@xxxxxxxxx> > Cc: Borislav Petkov <bp@xxxxxxxxx> > Signed-off-by: Vishal Verma <vishal.l.verma@xxxxxxxxx> > --- > arch/x86/include/asm/mce.h | 1 + > arch/x86/kernel/cpu/mcheck/mce.c | 3 ++- > drivers/acpi/nfit/mce.c | 4 ++-- > 3 files changed, 5 insertions(+), 3 deletions(-) > > v3: Unchanged from v2 > > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h > index 3a17107594c8..3111b3cee2ee 100644 > --- a/arch/x86/include/asm/mce.h > +++ b/arch/x86/include/asm/mce.h > @@ -216,6 +216,7 @@ static inline int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *s > > int mce_available(struct cpuinfo_x86 *c); > bool mce_is_memory_error(struct mce *m); > +bool mce_is_correctable(struct mce *m); > > DECLARE_PER_CPU(unsigned, mce_exception_count); > DECLARE_PER_CPU(unsigned, mce_poll_count); > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index 953b3ce92dcc..27015948bc41 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -534,7 +534,7 @@ bool mce_is_memory_error(struct mce *m) > } > EXPORT_SYMBOL_GPL(mce_is_memory_error); > > -static bool mce_is_correctable(struct mce *m) > +bool mce_is_correctable(struct mce *m) > { > if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED) > return false; > @@ -544,6 +544,7 @@ static bool mce_is_correctable(struct mce *m) > > return true; > } > +EXPORT_SYMBOL_GPL(mce_is_correctable); > > static bool cec_add_mce(struct mce *m) > { > diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c > index e9626bf6ca29..7a51707f87e9 100644 > --- a/drivers/acpi/nfit/mce.c > +++ b/drivers/acpi/nfit/mce.c > @@ -25,8 +25,8 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val, > struct acpi_nfit_desc *acpi_desc; > struct nfit_spa *nfit_spa; > > - /* We only care about memory errors */ > - if (!mce_is_memory_error(mce)) > + /* We only care about uncorrectable memory errors */ > + if (!mce_is_memory_error(mce) || mce_is_correctable(mce)) > return NOTIFY_DONE; > > /*