Re: [PATCH v2] Dump cper error table in mce_panic

Borislav Petkov <bp@xxxxxxxxx> · Fri, 20 Nov 2020 11:24:36 +0100

On Fri, Nov 20, 2020 at 05:22:35PM +0800, Aili Yao wrote:
> Hi, This test result if from tip/master, previous is upstream latest.

Thanks for doing those, now let's see.

With rc4 you have the MCE error in the first kernel:

[  106.956286] Disabling lock debugging due to kernel taint
[  106.962373] mce: [Hardware Error]: CPU 18: Machine Check Exception: 5 Bank 7: be00000001010091
[  106.962377] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffac58472a>
[  106.996488] {acpi_idle_do_entry+0x4a/0x60}
[  107.001057] mce: [Hardware Error]: TSC ae4b410af0b8 ADDR 314d193000 MISC 200400c008002086
[  107.010283] mce: [Hardware Error]: PROCESSOR 0:50657 TIME 1605843017 SOCKET 1 APIC 40 microcode 5000021
[  107.020767] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[  107.031295] mce: [Hardware Error]: Machine check: Processor context corrupt
[  107.039065] Kernel panic - not syncing: Fatal machine check

Now the kdump kernel fires and there's an error record in the CPER
thing.

> [    6.280390] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> [    6.288655] ACPI: Power Button [PWRF]
> [    6.292961] ERST: Error Record Serialization Table (ERST) support is initialized.
> [    6.301295] pstore: Registered erst as persistent store backend
> [    6.307912] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
> [    6.308886] {1}[Hardware Error]: event severity: fatal
> [    6.308886] {1}[Hardware Error]:  Error 0, type: fatal
> [    6.308886] {1}[Hardware Error]:  fru_text: Card03, ChnB, DIMM0
> [    6.308886] {1}[Hardware Error]:   section_type: memory error
> [    6.308886] {1}[Hardware Error]:   error_status: 0x0000000000000000

And this error_status is all clear. I wonder why.

Looking at the ÜFI spec "Section O: Error Status" it defines a couple
of bits there: whether it was an address or control bits error, who
detected the error (responder, requestor), whether it was the first
error, etc, etc.

And none of those bits are set.

Which makes me not trust that error record a whole lot but that's a
given, since it is firmware and firmware is an unfixable piece of crap
by definition.

So then one could probably say that if none of those error status bits
are set, then the error being reported is not something, let's say
"fresh". This is doubly the case considering that it gets detected when
the GHES driver probes:

        /* Handle any pending errors right away */
        spin_lock_irqsave(&ghes_notify_lock_irq, flags);
        ghes_proc(ghes);
        spin_unlock_irqrestore(&ghes_notify_lock_irq, flags);

so *maybe*, just *maybe* one could say here:

If the error_status doesn't have any valid bits *and* it has been
detected on driver init - i.e., the error has been there before the
driver probed, then even if the error is fatal, GHES should not call
__ghes_panic().

The even better way to detect this is to be able to check whether this
is the kdump kernel and whether it got loaded due to a fatal MCE in the
first kernel and then match that error address with the error address of
the error which caused the first panic in the mce code. Then the second
kernel won't need to panic but simply log.

However, I think that second way to check is probably hard and the first
heuristic is probably good enough...

Tony, thoughts?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette