On Tue, 2013-05-07 at 01:25 +0200, Borislav Petkov wrote: > On Mon, May 06, 2013 at 03:58:22PM -0700, Max Asbock wrote: > > This patch adds code to the machine check handler to parse the > > APEI Hardware Error Source Table (HEST) and check if Firmware First mode > > is set for the Correct Machine Check structure in HEST. > > If Firmware First mode is set, the corrected machine checks are expected > > to be reported through a Generic Hardware Error Source (GHES). Therefore in this > > case CMCI will not be enabled to avoid interference with the firmware's > > error reporting. > > > > Signed-off-by: Max Asbock <masbock@xxxxxxxxxxxxxxxxxx> > > --- > > include/asm/mce.h | 2 + > > kernel/cpu/mcheck/mce-apei.c | 70 +++++++++++++++++++++++++++++++++++++++ > > kernel/cpu/mcheck/mce-internal.h | 5 ++ > > kernel/cpu/mcheck/mce_intel.c | 6 ++- > > 4 files changed, 81 insertions(+), 2 deletions(-) > > > > > > diff -burpN linux-3.9/arch/x86/include/asm/mce.h linux-3.9.ff/arch/x86/include/asm/mce.h > > --- linux-3.9/arch/x86/include/asm/mce.h 2013-05-02 09:38:16.000000000 -0700 > > +++ linux-3.9.ff/arch/x86/include/asm/mce.h 2013-05-03 13:20:36.000000000 -0700 > > @@ -144,12 +144,14 @@ DECLARE_PER_CPU(struct device *, mce_dev > > > > #ifdef CONFIG_X86_MCE_INTEL > > void mce_intel_feature_init(struct cpuinfo_x86 *c); > > +void mce_intel_init_cmci(void); > > void cmci_clear(void); > > void cmci_reenable(void); > > void cmci_rediscover(void); > > void cmci_recheck(void); > > #else > > static inline void mce_intel_feature_init(struct cpuinfo_x86 *c) { } > > +static inline void mce_intel_init_cmci(void) {} > > static inline void cmci_clear(void) {} > > static inline void cmci_reenable(void) {} > > static inline void cmci_rediscover(void) {} > > diff -burpN linux-3.9/arch/x86/kernel/cpu/mcheck/mce-apei.c linux-3.9.ff/arch/x86/kernel/cpu/mcheck/mce-apei.c > > --- linux-3.9/arch/x86/kernel/cpu/mcheck/mce-apei.c 2013-05-02 09:38:16.000000000 -0700 > > +++ linux-3.9.ff/arch/x86/kernel/cpu/mcheck/mce-apei.c 2013-05-06 14:25:48.000000000 -0700 > > @@ -32,6 +32,7 @@ > > #include <linux/kernel.h> > > #include <linux/acpi.h> > > #include <linux/cper.h> > > +#include <linux/init.h> > > #include <acpi/apei.h> > > #include <asm/mce.h> > > > > @@ -148,3 +149,72 @@ int apei_clear_mce(u64 record_id) > > { > > return erst_clear(record_id); > > } > > + > > + > > +/* Support for Firmware First (FF) mode for Corrected Machine Checks > > + * as defined by APEI in the Hardware Error Source Table (HEST). > > + * (see section 18.3.2.2 in the ACPI spec, version 5.0) > > + * > > + * When Firmware First mode is specified in HEST for Corrected Machine Checks, > > + * these errors are expected to be reported by the firmware through a > > + * Generic Harware Event Source (GHES). In this case error reporting > > + * through CMCI should not be enabled so it won't interfere with the firmware. > > + * > > + * In the boot sequence HEST parsing is initialized after > > + * CPUs, therefore initially we don't know if Firmware First is set. > > + * We just assume yes, and enable CMCI in a late init call if FF is not set. > > I'm very sceptical about that assumption - I don't think that the > majority of boxes out there *now* have APEI and FF enabled (and those > working properly :)). > > Besides, there's a bigger problem with this - in the window between > mce_intel_feature_init() and honor_cmc_firmware_first() you don't handle > any correctable errors, AFAICT. You bring up good points here. Most systems probably don't support APEI and FF right now. Therefore a patch should not affect the behavior of the majority of systems. And this current patch does leave a window where CMCI is not enabled for all systems. I actually have a different version of this patch where CMCI is always enabled first and subsequently disabled if a HEST and FF is found, more similar to what you suggest below. > > > + */ > > +static bool cmc_firmware_first; > > +static bool cmc_firmware_first_checked; > > + > > +bool apei_cmc_firmware_first(void) > > +{ > > + if (!cmc_firmware_first_checked) > > + return true; > > + return cmc_firmware_first; > > +} > > + > > +static int check_cmc_firmware_first(struct acpi_hest_header *hest_hdr, void *d) > > +{ > > + if (hest_hdr->type == ACPI_HEST_TYPE_IA32_CORRECTED_CHECK) { > > + struct acpi_hest_ia_corrected *cmc; > > + > > + cmc = (struct acpi_hest_ia_corrected *)hest_hdr; > > + > > + if (cmc->flags & ACPI_HEST_FIRMWARE_FIRST) { > > + cmc_firmware_first = true; > > + return 1; > > + } > > + } > > + return 0; > > +} > > + > > +static void intel_late_init_cmci(void *data) > > +{ > > + if (!mce_available(__this_cpu_ptr(&cpu_info))) > > + return; > > + > > + mce_intel_init_cmci(); > > +} > > + > > +static __init int honor_cmc_firmware_first(void) > > +{ > > + struct cpuinfo_x86 *c = __this_cpu_ptr(&cpu_info); > > + > > + if (c->x86_vendor != X86_VENDOR_INTEL) > > + return 0; > > + > > + apei_hest_parse(check_cmc_firmware_first, NULL); > > + cmc_firmware_first_checked = true; > > + > > + if (cmc_firmware_first) > > + mca_cfg.cmci_disabled = true; > > + > > + if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce) > > + return 0; > > + > > + on_each_cpu(intel_late_init_cmci, NULL, 1); > > + return 0; > > +} > > + > > +late_initcall(honor_cmc_firmware_first); > > This glue code should go to drivers/acpi/apei/ where the rest of the > APEI gunk lives. But this is glue code between MCE and APEI, therefore I thought a file named mce-apei.c would be a good place. If we put it into apei then we might have to export some of the MCE interfaces, whereas apei_hest_parse() is already exported. > > > diff -burpN linux-3.9/arch/x86/kernel/cpu/mcheck/mce_intel.c linux-3.9.ff/arch/x86/kernel/cpu/mcheck/mce_intel.c > > --- linux-3.9/arch/x86/kernel/cpu/mcheck/mce_intel.c 2013-05-02 09:38:16.000000000 -0700 > > +++ linux-3.9.ff/arch/x86/kernel/cpu/mcheck/mce_intel.c 2013-05-03 13:21:53.000000000 -0700 > > @@ -315,7 +315,7 @@ void cmci_reenable(void) > > cmci_discover(banks); > > } > > > > -static void intel_init_cmci(void) > > +void mce_intel_init_cmci(void) > > { > > int banks; > > > > @@ -337,5 +337,7 @@ static void intel_init_cmci(void) > > void mce_intel_feature_init(struct cpuinfo_x86 *c) > > { > > intel_init_thermal(c); > > - intel_init_cmci(); > > + > > + if (!apei_cmc_firmware_first()) > > + mce_intel_init_cmci(); > > Right, so I think the right approach should be to let CMCI get > initialized, then when apei gets to run, it calls cmci_clear() to turn > it off and continue its own FF handling. And this should be doable > without touching any MCA internals like mca_cfg and such. Simply > exporting the CMCI interfaces should be much cleaner. I can post another version of this patch where CMCI gets initialized first. > > Btw, we should keep the possibility open to be able to fall back to CMCI > at any point when, for some reason, it has been determined that APEI > functionality is out of order but we still want to have CMCI polling, > independent from the BIOS. Agreed, it would be good to be able to override FF. thanks, Max -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html