From: Aili Yao <yaoaili@xxxxxxxxxxxx> Hi, Thanks for your comments! (Re-transmit,sorry for error mail format) > -----Original Message----- > From: James Morse [mailto:james.morse@xxxxxxx] > Sent: Thursday, October 29, 2020 3:20 AM > To: yaoaili126@xxxxxxx > Cc: rjw@xxxxxxxxxxxxx; lenb@xxxxxxxxxx; tony.luck@xxxxxxxxx; bp@xxxxxxxxx; > linux-acpi@xxxxxxxxxxxxxxx; YANGFENG1 [杨峰] > <YANGFENG1@xxxxxxxxxxxx>; CHENGUOMIN [陈国民] > <CHENGUOMIN@xxxxxxxxxxxx>; yaoaili [么爱利] <yaoaili@xxxxxxxxxxxx> > Subject: Re: [PATCH] Don't do panic for memory Fatal UE in ghes of x86_mce > platform > > Hi! > > On 28/10/2020 07:35, yaoaili126@xxxxxxx wrote: > > From: Aili Yao <yaoaili@xxxxxxxxxxxx> > > > > For x86 with mce, when BIOS get its work done for memory UE,it will > > raise MCE exception, In MCE, it will do panic or recovery work there. > > But When BIOS option WHEA memory log is enabled, > > This is GHES_ASSIST? sorry, I am not sure, What i know is this option for x86 is only to supply more error information. > > > BIOS also prepared one > > detailed error table which will be polled by ghes_notify_nmi from NMI > > watchdog, > > heh, because the NMI notification chain has no idea what triggered it... > > > > ghes_notify_nmi will check the severity and do panic too, this panic > > action is not necessary and confusing, and may lead to unwanted > > results like core dump fail. > > > > Downgrade CPER_SEV_FATAL to GHES_SEV_RECOVERABLE before panic is > > called for x86_mce > > You can't know whether your platform (will?) also generate an MCE when > you build the kernel. Distros set all the Kconfig options, they aren't tuned to > the platform. > > This also makes fatal memory errors recoverable, which isn't true for other > platforms. > I am little confused, if any thing wrong, correct me: I know ACPI and APEI support x86 and ARM, is there any other Platform I don't know? if a Platform doesn't support this, this code will never execute even when the config is on. And for a ARM platfrom Distros, I don't think it's right to have CONFIG_X86_MCE on. > > Assuming this is GHES_ASSIST, I think the simplest approach is to skip the > panic() for CPER records found for those GHES. APEI is only providing > assistance in this case, so its unfair for it to take some terminal action. The > machine-check handler should have the final say in in this case. > > > Section 18.7 of ACPI v6.3a says we're supposed to: > | consume the additional GHES_ASSIST information in the context of an > | error reported by hardware > which would be the MCE - so there is no reason for these GHES entries to be > poked whenever an unrelated NMI occurs. > > Fixing this would need a separate list for the machine check handler to poke > to dump any data from GHES_ASSIST CPER, so its more work than just > skipping the register call. > This is a great idea! This way we may collect more error info and avoid the issue mentioned before. I do realize a prototype following your method,i will try it and test it. > (I've no idea how x86 prioritises MCE and NMI... does one block the other?) >From intel sdm, it says MCE has higher priority than NMI. but the latest kernel is changing fast, NMI may can preemt the MCE for userspace? I am not sure. > > > Thanks, > > James > > > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index > > 81bf71b10d44..e5e8a53beb5a 100644 > > --- a/drivers/acpi/apei/ghes.c > > +++ b/drivers/acpi/apei/ghes.c > > @@ -272,7 +272,27 @@ static inline int ghes_severity(int severity) > > case CPER_SEV_RECOVERABLE: > > return GHES_SEV_RECOVERABLE; > > case CPER_SEV_FATAL: > > + { > > +#ifdef CONFIG_X86_MCE > > + int sev, sec_sev; > > + struct acpi_hest_generic_data *gdata; > > + guid_t *sec_type; > > + > > + if (estatus == NULL) > > + return GHES_SEV_PANIC; > > + > > + apei_estatus_for_each_section(estatus, gdata) { > > + sec_type = (guid_t *)gdata->section_type; > > + sec_sev = gdata->error_severity; > > + if (sec_sev == CPER_SEV_FATAL && > > + !guid_equal(sec_type, > &CPER_SEC_PLATFORM_MEM)) > > + return GHES_SEV_PANIC; > > + } > > + return GHES_SEV_RECOVERABLE; > > +#else > > return GHES_SEV_PANIC; > > +#endif > > + } Thanks Aili Yao