> On Fri, Nov 20, 2020 at 05:22:35PM +0800, Aili Yao wrote: > > Hi, This test result if from tip/master, previous is upstream latest. > > Thanks for doing those, now let's see. > > With rc4 you have the MCE error in the first kernel: > > [ 106.956286] Disabling lock debugging due to kernel taint > [ 106.962373] mce: [Hardware Error]: CPU 18: Machine Check Exception: 5 Bank 7: be00000001010091 > [ 106.962377] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffac58472a> > [ 106.996488] {acpi_idle_do_entry+0x4a/0x60} > [ 107.001057] mce: [Hardware Error]: TSC ae4b410af0b8 ADDR 314d193000 MISC 200400c008002086 > [ 107.010283] mce: [Hardware Error]: PROCESSOR 0:50657 TIME 1605843017 SOCKET 1 APIC 40 microcode 5000021 > [ 107.020767] mce: [Hardware Error]: Run the above through 'mcelog --ascii' > [ 107.031295] mce: [Hardware Error]: Machine check: Processor context corrupt > [ 107.039065] Kernel panic - not syncing: Fatal machine check > > Now the kdump kernel fires and there's an error record in the CPER > thing. > > > [ 6.280390] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0 > > [ 6.288655] ACPI: Power Button [PWRF] > > [ 6.292961] ERST: Error Record Serialization Table (ERST) support is initialized. > > [ 6.301295] pstore: Registered erst as persistent store backend > > [ 6.307912] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 > > [ 6.308886] {1}[Hardware Error]: event severity: fatal > > [ 6.308886] {1}[Hardware Error]: Error 0, type: fatal > > [ 6.308886] {1}[Hardware Error]: fru_text: Card03, ChnB, DIMM0 > > [ 6.308886] {1}[Hardware Error]: section_type: memory error > > [ 6.308886] {1}[Hardware Error]: error_status: 0x0000000000000000 > > And this error_status is all clear. I wonder why. > > Looking at the ÜFI spec "Section O: Error Status" it defines a couple > of bits there: whether it was an address or control bits error, who > detected the error (responder, requestor), whether it was the first > error, etc, etc. > > And none of those bits are set. > > Which makes me not trust that error record a whole lot but that's a > given, since it is firmware and firmware is an unfixable piece of crap > by definition. > > So then one could probably say that if none of those error status bits > are set, then the error being reported is not something, let's say > "fresh". This is doubly the case considering that it gets detected when > the GHES driver probes: > > /* Handle any pending errors right away */ > spin_lock_irqsave(&ghes_notify_lock_irq, flags); > ghes_proc(ghes); > spin_unlock_irqrestore(&ghes_notify_lock_irq, flags); > > so *maybe*, just *maybe* one could say here: > > If the error_status doesn't have any valid bits *and* it has been > detected on driver init - i.e., the error has been there before the > driver probed, then even if the error is fatal, GHES should not call > __ghes_panic(). > > The even better way to detect this is to be able to check whether this > is the kdump kernel and whether it got loaded due to a fatal MCE in the > first kernel and then match that error address with the error address of > the error which caused the first panic in the mce code. Then the second > kernel won't need to panic but simply log. > > However, I think that second way to check is probably hard and the first > heuristic is probably good enough... > > Tony, thoughts? > Long away from this issue, any feedback? >From kexec-tool, the hest_disable parameter has been added to 2nd kernel, So the kdump will not be affected by ghes errors. But we still may lose the ghes error info, so i think this patch is still needed? Thanks -- Best Regards! Aili Yao