Re: [PATCH] Dump cper error table in mce_panic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello!

On 04/11/2020 06:50, yaoaili126@xxxxxxx wrote:
> From: Aili Yao <yaoaili@xxxxxxxxxxxx>
> 
> For X86_MCE, When there is a fatal ue error, BIOS will prepare one
> detailed cper error table before raising MCE,

(outside GHES-ASSIST), Its not supposed to do this.

There is an example flow described in 18.4.1 "Example: Firmware First Handling Using NMI
Notification" of ACPI v6.3:
https://uefi.org/sites/default/files/resources/ACPI_Spec_6_3_A_Oct_6_2020.pdf


The machine-check is the notification from hardware, which in step 1 of the above should
go to firmware. You should only see an NMI, which is step 8.
Step 7 is to clear the error from hardware, so triggering a machine-check is pointless.
(but I agree no firmware ever follows this!)


You appear to have something that behaves as GHES-ASSIST. Can you post the decompiled dump
of your HEST table? (decompiled, no binaries!) If its large, you can post it to me off
list and I'll copy the relevant bits here...


> this cper table is meant
> to supply addtional error information and not to race with mce handler
> to panic.

This is a description of GHES_ASSIST. See 18.7 "GHES_ASSIST Error Reporting" of the above pdf.


> Usually possible unexpected cper process from NMI watchdog race panic
> with MCE panic is not a problem, the panic process will coordinate with
> each core. But When the CPER is not processed in the first kernel and
> leave it to the second kernel, this is a problem, lead to a kdump fail.

> Now in this patch, the mce_panic will race with unexpected NMI to dump
> the cper error log and get it cleaned, this will prevent the cper table
> leak to the second kernel, which will fix the kdump fail problem, and
> also guarrante the cper log is collected which it's meant to.

> Anyway,For x86_mce platform, the ghes module is still needed not to
> panic for fatal memory UE as it's MCE handler's work.

If and only if those GHES are marked as GHES_ASSIST.

If they are not, then you have a fully fledged firwmare-first system.

Could you share what your system is describing it as in the HEST so we can work out what
is going on here?!

We need to work this out first.


Thanks,

James




[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux