Re:Re: [PATCH] Don't do panic for memory Fatal UE in ghes of x86_mce platform

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Aili Yao <yaoaili@xxxxxxxxxxxx>


Hi, Thanks for your comments! (Re-transmit,sorry for error mail format)

> -----Original Message-----
> From: James Morse [mailto:james.morse@xxxxxxx]
> Sent: Thursday, October 29, 2020 3:20 AM
> To: yaoaili126@xxxxxxx
> Cc: rjw@xxxxxxxxxxxxx; lenb@xxxxxxxxxx; tony.luck@xxxxxxxxx; bp@xxxxxxxxx;
> linux-acpi@xxxxxxxxxxxxxxx; YANGFENG1 [杨峰]
> <YANGFENG1@xxxxxxxxxxxx>; CHENGUOMIN [陈国民]
> <CHENGUOMIN@xxxxxxxxxxxx>; yaoaili [么爱利] <yaoaili@xxxxxxxxxxxx>
> Subject: Re: [PATCH] Don't do panic for memory Fatal UE in ghes of x86_mce
> platform
> 
> Hi!
> 
> On 28/10/2020 07:35, yaoaili126@xxxxxxx wrote:
> > From: Aili Yao <yaoaili@xxxxxxxxxxxx>
> >
> > For x86 with mce, when BIOS get its work done for memory UE,it will
> > raise MCE exception, In MCE, it will do panic or recovery work there.
> > But When BIOS option WHEA memory log is enabled,
> 
> This is GHES_ASSIST?

sorry, I am not sure, What i know is this option for x86 is only to supply
more error information.

> 
> > BIOS also prepared one
> > detailed error table which will be polled by ghes_notify_nmi from NMI
> > watchdog,
> 
> heh, because the NMI notification chain has no idea what triggered it...
> 
> 
> > ghes_notify_nmi will check the severity and do panic too, this panic
> > action is not necessary and confusing, and may lead to unwanted
> > results like core dump fail.
> >
> > Downgrade CPER_SEV_FATAL to GHES_SEV_RECOVERABLE before panic is
> > called for x86_mce
> 
> You can't know whether your platform (will?) also generate an MCE when
> you build the kernel. Distros set all the Kconfig options, they aren't tuned to
> the platform.
> 
> This also makes fatal memory errors recoverable, which isn't true for other
> platforms.
> 

I am little confused, if any thing wrong, correct me: I know ACPI and APEI support
x86 and ARM, is there any other Platform I don't know? if a Platform doesn't support 
this, this code will never execute even when the config is on. And for a ARM platfrom 
Distros, I don't think it's right to have CONFIG_X86_MCE on. 

> 
> Assuming this is GHES_ASSIST, I think the simplest approach is to skip the
> panic() for CPER records found for those GHES. APEI is only providing
> assistance in this case, so its unfair for it to take some terminal action. The
> machine-check handler should have the final say in in this case.
> 
> 
> Section 18.7 of ACPI v6.3a says we're supposed to:
> | consume the additional GHES_ASSIST information in the context of an
> | error reported by hardware
> which would be the MCE - so there is no reason for these GHES entries to be
> poked whenever an unrelated NMI occurs.
> 
> Fixing this would need a separate list for the machine check handler to poke
> to dump any data from GHES_ASSIST CPER, so its more work than just
> skipping the register call.
> 

This is a great idea! This way we may collect more error info and avoid the issue mentioned
before. I do realize a prototype following your method,i will try it and test it. 

> (I've no idea how x86 prioritises MCE and NMI... does one block the other?)

>From intel sdm, it says MCE has higher priority than NMI. but the latest kernel is changing
fast, NMI may can preemt the MCE for userspace? I am not sure. 

> 
> 
> Thanks,
> 
> James
> 
> 
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
> > 81bf71b10d44..e5e8a53beb5a 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -272,7 +272,27 @@ static inline int ghes_severity(int severity)
> >  	case CPER_SEV_RECOVERABLE:
> >  		return GHES_SEV_RECOVERABLE;
> >  	case CPER_SEV_FATAL:
> > +	{
> > +#ifdef CONFIG_X86_MCE
> > +		int sev, sec_sev;
> > +		struct acpi_hest_generic_data *gdata;
> > +		guid_t *sec_type;
> > +
> > +		if (estatus == NULL)
> > +			return GHES_SEV_PANIC;
> > +
> > +		apei_estatus_for_each_section(estatus, gdata) {
> > +			sec_type = (guid_t *)gdata->section_type;
> > +			sec_sev = gdata->error_severity;
> > +			if (sec_sev == CPER_SEV_FATAL &&
> > +			 !guid_equal(sec_type,
> &CPER_SEC_PLATFORM_MEM))
> > +				return GHES_SEV_PANIC;
> > +		}
> > +		return GHES_SEV_RECOVERABLE;
> > +#else
> >  		return GHES_SEV_PANIC;
> > +#endif
> > +	}

Thanks
Aili Yao




[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux