Re: [PATCH v2 2/5] x86/mce: dump error msg from severities

Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> · Sun, 2 Mar 2025 15:14:52 +0800

在 2025/3/2 02:47, Borislav Petkov 写道:
On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote:
(By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
it breaks EDAC decoding. We do not use CEC in production at all for the same
reasion.)

It doesn't "break" error decoding - it collects every correctable DRAM error
and puts it in "leaky" bucket of sorts. And when a certain error address
generates too many errors, it memory_failure()s the page and poisons it.

You do not use it in production because you want to see every error, collect
it, massage it and perhaps decide when DIMMs go bad and you can replace
them... or whatever you do.

All the others who enable it and we can sleep properly, without getting
unnecessarily upset about a correctable error.

Yes, we want to see event CE error and use the CE pattern (e.g. correctable
error-bit)[1][2] to  predict whether a row fault is prone to UEs or not.
And we are not upset to CE error, becasue it have corrected by hardware :)

[1]https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fault-aware-prediction-guide.pdf
[2]https://arxiv.org/html/2312.02855v2

Yes, we collect all kernel message from host, parse the logs and predict panic
with AI tools. The more details we collect, the better the performance of
the AI model.

LOL.

We go the great effort of going a MCE tracepoint which gives a *structured*
error record, show an example how to use
it in rasdaemon and you go and do the crazy hard and, at the same time, silly
thing and parse dmesg?!??!

This is priceless. Oh boy.

Agreed, tracepoint is a more elegant way. However, it does not include error
context, just some hardware registers.

The error context is in the behavior of the hw. If the error is fatal, you
won't see it - the machine will panic or do something else to prevent error
propagation. It definitely won't run any software anymore.

If you see the error getting logged, it means it is not fatal enough to kill
the machine.

Agreed.

Besides, this message is completely useless as it has no concrete info about
the error and what is being done about it.

I don't think so,

I think so and you're not reading my mail.

     "mce: Uncorrected hardware memory error in user-access at 3b116c400"

It is the current message in kill_me_maybe(), not added by me.

Ask yourself: what can you do when you see a message like that?

Exactly *nothing* because there's not nearly enough information to recover
from it or log it or whatever. That error message is *totally useless* and
you're upsetting your users unnecessarily and even if they report it to you,
you can't help them.

I believe we are approaching this issue from different perspectives.
As a cloud service provider, I need to address the following points:

1. I must be able to explain to end users why the MCE has occurred.
2. It is important to determine whether there are any kernel bugs that could
   compromise the overall stability of the cloud platform.
3. We need to identify and implement potential improvements.

"mce: Uncorrected hardware memory error in user-access at 3b116c400"

is *nothing* but

"mce: Action required: data load in error recoverable area of kernel"

helps.

Thanks for your time.
Shuai