On Fri, Nov 20, 2020 at 05:22:35PM +0800, Aili Yao wrote: > Hi, This test result if from tip/master, previous is upstream latest. Thanks for doing those, now let's see. With rc4 you have the MCE error in the first kernel: [ 106.956286] Disabling lock debugging due to kernel taint [ 106.962373] mce: [Hardware Error]: CPU 18: Machine Check Exception: 5 Bank 7: be00000001010091 [ 106.962377] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffac58472a> [ 106.996488] {acpi_idle_do_entry+0x4a/0x60} [ 107.001057] mce: [Hardware Error]: TSC ae4b410af0b8 ADDR 314d193000 MISC 200400c008002086 [ 107.010283] mce: [Hardware Error]: PROCESSOR 0:50657 TIME 1605843017 SOCKET 1 APIC 40 microcode 5000021 [ 107.020767] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 107.031295] mce: [Hardware Error]: Machine check: Processor context corrupt [ 107.039065] Kernel panic - not syncing: Fatal machine check Now the kdump kernel fires and there's an error record in the CPER thing. > [ 6.280390] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0 > [ 6.288655] ACPI: Power Button [PWRF] > [ 6.292961] ERST: Error Record Serialization Table (ERST) support is initialized. > [ 6.301295] pstore: Registered erst as persistent store backend > [ 6.307912] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 > [ 6.308886] {1}[Hardware Error]: event severity: fatal > [ 6.308886] {1}[Hardware Error]: Error 0, type: fatal > [ 6.308886] {1}[Hardware Error]: fru_text: Card03, ChnB, DIMM0 > [ 6.308886] {1}[Hardware Error]: section_type: memory error > [ 6.308886] {1}[Hardware Error]: error_status: 0x0000000000000000 And this error_status is all clear. I wonder why. Looking at the ÜFI spec "Section O: Error Status" it defines a couple of bits there: whether it was an address or control bits error, who detected the error (responder, requestor), whether it was the first error, etc, etc. And none of those bits are set. Which makes me not trust that error record a whole lot but that's a given, since it is firmware and firmware is an unfixable piece of crap by definition. So then one could probably say that if none of those error status bits are set, then the error being reported is not something, let's say "fresh". This is doubly the case considering that it gets detected when the GHES driver probes: /* Handle any pending errors right away */ spin_lock_irqsave(&ghes_notify_lock_irq, flags); ghes_proc(ghes); spin_unlock_irqrestore(&ghes_notify_lock_irq, flags); so *maybe*, just *maybe* one could say here: If the error_status doesn't have any valid bits *and* it has been detected on driver init - i.e., the error has been there before the driver probed, then even if the error is fatal, GHES should not call __ghes_panic(). The even better way to detect this is to be able to check whether this is the kdump kernel and whether it got loaded due to a fatal MCE in the first kernel and then match that error address with the error address of the error which caused the first panic in the mce code. Then the second kernel won't need to panic but simply log. However, I think that second way to check is probably hard and the first heuristic is probably good enough... Tony, thoughts? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette