Re: [PATCH v2 2/5] x86/mce: dump error msg from severities

Borislav Petkov <bp@xxxxxxxxx> · Sat, 1 Mar 2025 12:10:22 +0100

On Sat, Mar 01, 2025 at 02:16:12PM +0800, Shuai Xue wrote:
> For instance, it does not specify whether the error occurred in the
> context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
> understanding the error's circumstances.

1. Crucial for whom? For you? Or for users?

You need to explain how this error message is going to be used. Because simply
issuing such a message causes a lot of panicked people calling a lot of admins
to figure out why their machine is broken. Because they see "mce" and think
"hw broken, need to replace it immediately."

This is one of the reasons we did the cec.c thing - just to save people from
panicking unnecessarily and causing expensive and useless maintenance calls.

2. This message goes to dmesg which means something needs to parse it, beside
   a human. An AI?

3. Dmesg is a ring buffer which gets overwritten and this message is
   eventually lost

There's a reason why MCEs get logged with the notifiers and through
a tracepoint - so that agents can act upon them properly.

And we have had this discussion for years now - I'm sorry that you're late to
the party.

> For the regression cases (copy from user) in Patch 3, an error message
> 
>     "mce: Action required: data load in error recoverable area of kernel"

See above.

Besides, this message is completely useless as it has no concrete info about
the error and what is being done about it.

> I could add more explanations in next version if you have no objection.

All of the above are objections.

Please go into git history and read why we're avoiding dumping useless
messages instead of proposing silly patches.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette