On 2023/10/7 15:15, Shuai Xue wrote: > > > On 2023/9/28 22:43, Borislav Petkov wrote: >> On Mon, Sep 25, 2023 at 03:44:17PM +0800, Shuai Xue wrote: >>> After /dev/mcelog character device deprecated by commit 5de97c9f6d85 >>> ("x86/mce: Factor out and deprecate the /dev/mcelog driver"), the >>> serialized MCE error record, of previous boot in persistent storage is not >>> collected via APEI ERST. >> >> You lost me here. /dev/mcelog is deprecated but you can still use it and >> apei_write_mce() still happens. > > Yes, you are right. apei_write_mce() still happens so that MCE records are > written to persistent storage and the MCE records can be retrieved by > apei_read_mce(). Previously, the task was performed by the mcelog package. > However, it has been deprecated, some distributions like Arch kernels are > not even compiled with the necessary configuration option > CONFIG_X86_MCELOG_LEGACY.[1] > > So, IMHO, it's better to add a way to retrieve MCE records through switching > to the new generation rasdaemon solution. > >> >> Looking at your patches, you're adding this to ghes so how about you sit >> down first and explain your exact use case and what exactly you wanna >> do? >> >> Thx. >> > > Sorry for the poor cover letter. I hope the following response can clarify > the matter. > > Q1: What is the exact problem? > > Traditionally, fatal hardware errors will cause Linux print error log to > console, e.g. print_mce() or __ghes_print_estatus(), then reboot. With > Linux, the primary method for obtaining debugging information of a serious > error or fault is via the kdump mechanism. Kdump captures a wealth of > kernel and machine state and writes it to a file for post-mortem debugging. > > In certain scenarios, ie. hosts/guests with root filesystems on NFS/iSCSI > where networking software and/or hardware fails, and thus kdump fails to > collect the hardware error context, leaving us unaware of what actually > occurred. In the public cloud scenario, multiple virtual machines run on a > single physical server, and if that server experiences a failure, it can > potentially impact multiple tenants. It is crucial for us to thoroughly > analyze the root causes of each instance failure in order to: > > - Provide customers with a detailed explanation of the outage to reassure them. > - Collect the characteristics of the failures, such as ECC syndrome, to enable fault prediction. > - Explore potential solutions to prevent widespread outages. > > In short, it is necessary to serialize hardware error information available > for post-mortem debugging. > > Q2: What exactly I wanna do: > > The MCE handler, do_machine_check(), saves the MCE record to persistent > storage and it is retrieved by mcelog. Mcelog has been deprecated when > kernel 4.12 released in 2017, and the help of the configuration option > CONFIG_X86_MCELOG_LEGACY suggest to consider switching to the new > generation rasdaemon solution. The GHES handler does not support APEI error > record now. > > To serialize hardware error information available for post-mortem > debugging: > - add support to save APEI error record into flash via ERST before go panic, > - add support to retrieve MCE or APEI error record from the flash and emit > the related tracepoint after system boot successful again so that rasdaemon > can collect them > > > Best Regards, > Shuai > > > [1] https://wiki.archlinux.org/title/Machine-check_exception Hi, Borislav, I would like to inquire about your satisfaction with the motivation provided. If you have no objections, I am prepared to address Kees's comments, update the cover letter, and proceed with sending a new version. Thank you. Best Regards, Shuai