Re: [PATCH 3/3] mce: acpi/apei: trace: Enable ghes memory error trace event

Borislav Petkov <bp@xxxxxxxxx> · Thu, 15 Aug 2013 12:14:32 +0200

On Wed, Aug 14, 2013 at 06:38:09PM +0000, Luck, Tony wrote:
> We've wandered around different strategies here. We definitely
> want the panic log. Some people want all other "kernel exit" logs
> (shutdown, reboot, kexec). When there is enough space in the pstore
> backend we might also want the "oops" that preceeded the panic. (Of
> course when the oops happens we don't know the future, so have to
> save it just in case ... then if more "oops" happen we have to decide
> whether to keep the old oops log, or save the newer one).

Ok, dmesg over serial and *only* oops+panic in pstore. Right.

> Yes - longer logs are better. Sad that the pstore backend devices are
> measured in kilobytes :-)

Right, so good ole serial again to the rescue! There's no room for full
dmesg in nvram because it needs space for the UEFI GUI and some other
crap :-)

> No - write speed for the persistent storage backing pstore (flash)
> means we don't log as we go. We wait for a panic and then our
> registered function gets called so we can snapshot what is in the
> console log at that point. We also don't want to wear out the flash
> which may be soldered to the motherboard.

I suspected as much. So we can forget about using *only* pstore for hw
errors logging. It would be cool to do so but the technology simply
doesn't give it.

> Agreed - we shouldn't clutter logs with details of corrected errors.
> At most we should have a rate-limited log showing the count of
> corrected errors so that someone who just watches dmesg knows they
> should go dig deeper if they see some big number of corrected errors.

/me nods.

> Yes. There are people looking at various "flight recorder" modes for
> tracing that keep logs of normal events in a circular buffer in RAM
> ... if these exist they should be saved at crash time (and they are in
> the kexec/kdump path, but I don’t know if anyone does anything in
> the non-kdump case).

Right, the cheapest solution is serial. Simply log everything to serial
because we can. But this is the key thing I wanted to emphasize:

For severe hardware errors we don't want to use any tracepoint -
actually it is even a bad thing to do so because they would get lost in
some side channels which, during a critical situation, might not get
written to anything/survive the crash, etc.

So what I'm saying is, we basically want severe hardware errors to land
to good old dmesg and to all consoles. No fancy TP stuff for them.

> Tracepoints for errors that are going to lead to system crash would
> only be useful together with a flight recorder to make sure they get
> saved. I think tracepoints for corrected errors are better than dmesg
> logs.

Yes, exactly.

> In a perfect world yes - I don't know that we can achieve perfection
> - but we can iterate through good, better, even better. The really
> hard part of this is figuring out what is *relevant* to save before a
> particular crash happens.

Well, if I have serial connected to the box, it will contain basically
everything the machine said, no?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html