Re: [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and APEI errors

Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> · Thu, 26 Oct 2023 18:21:19 +0800

On 2023/10/7 15:15, Shuai Xue wrote:
> 
> 
> On 2023/9/28 22:43, Borislav Petkov wrote:
>> On Mon, Sep 25, 2023 at 03:44:17PM +0800, Shuai Xue wrote:
>>> After /dev/mcelog character device deprecated by commit 5de97c9f6d85
>>> ("x86/mce: Factor out and deprecate the /dev/mcelog driver"), the
>>> serialized MCE error record, of previous boot in persistent storage is not
>>> collected via APEI ERST.
>>
>> You lost me here. /dev/mcelog is deprecated but you can still use it and
>> apei_write_mce() still happens.
> 
> Yes, you are right. apei_write_mce() still happens so that MCE records are
> written to persistent storage and the MCE records can be retrieved by
> apei_read_mce(). Previously, the task was performed by the mcelog package.
> However, it has been deprecated, some distributions like Arch kernels are
> not even compiled with the necessary configuration option
> CONFIG_X86_MCELOG_LEGACY.[1]
> 
> So, IMHO, it's better to add a way to retrieve MCE records through switching
> to the new generation rasdaemon solution.
> 
>>
>> Looking at your patches, you're adding this to ghes so how about you sit
>> down first and explain your exact use case and what exactly you wanna
>> do?
>>
>> Thx.
>>
> 
> Sorry for the poor cover letter. I hope the following response can clarify
> the matter.
> 
> Q1: What is the exact problem?
> 
> Traditionally, fatal hardware errors will cause Linux print error log to
> console, e.g. print_mce() or __ghes_print_estatus(), then reboot. With
> Linux, the primary method for obtaining debugging information of a serious
> error or fault is via the kdump mechanism. Kdump captures a wealth of
> kernel and machine state and writes it to a file for post-mortem debugging.
> 
> In certain scenarios, ie. hosts/guests with root filesystems on NFS/iSCSI
> where networking software and/or hardware fails, and thus kdump fails to
> collect the hardware error context, leaving us unaware of what actually
> occurred. In the public cloud scenario, multiple virtual machines run on a
> single physical server, and if that server experiences a failure, it can
> potentially impact multiple tenants. It is crucial for us to thoroughly
> analyze the root causes of each instance failure in order to:
> 
> - Provide customers with a detailed explanation of the outage to reassure them.
> - Collect the characteristics of the failures, such as ECC syndrome, to enable fault prediction.
> - Explore potential solutions to prevent widespread outages.
> 
> In short, it is necessary to serialize hardware error information available
> for post-mortem debugging.
> 
> Q2: What exactly I wanna do:
> 
> The MCE handler, do_machine_check(), saves the MCE record to persistent
> storage and it is retrieved by mcelog. Mcelog has been deprecated when
> kernel 4.12 released in 2017, and the help of the configuration option
> CONFIG_X86_MCELOG_LEGACY suggest to consider switching to the new
> generation rasdaemon solution. The GHES handler does not support APEI error
> record now.
> 
> To serialize hardware error information available for post-mortem
> debugging:
> - add support to save APEI error record into flash via ERST before go panic,
> - add support to retrieve MCE or APEI error record from the flash and emit
> the related tracepoint after system boot successful again so that rasdaemon
> can collect them
> 
> 
> Best Regards,
> Shuai
> 
> 
> [1] https://wiki.archlinux.org/title/Machine-check_exception

Hi, Borislav,

I would like to inquire about your satisfaction with the motivation
provided. If you have no objections, I am prepared to address Kees's
comments, update the cover letter, and proceed with sending a new version.

Thank you.

Best Regards,
Shuai