Re: [PATCH v2 2/5] x86/mce: dump error msg from severities

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 03, 2025 at 04:49:25PM +0000, Luck, Tony wrote:
> > The error context is in the behavior of the hw. If the error is fatal, you
> > won't see it - the machine will panic or do something else to prevent error
> > propagation. It definitely won't run any software anymore.
> >
> > If you see the error getting logged, it means it is not fatal enough to kill
> > the machine.
> 
> One place in the fatal case where I would like to see more information is the
> 
>   "Action required: data load in error *UN*recoverable area of kernel"
> 
> [emphasis on the "UN" added].
> 
> case.  We have a few places where the kernel does recover. And most places
> we crash. Our code for the recoverable cases is fragile. Most of this series is
> about repairing regressions where we used to recover from places where kernel
> is doing get_user() or copy_from_user() which can be recovered if those places
> get an error return and the kernel kills the process instead of crashing.
> 
> A long time ago I posted some patches to include a stack trace for this type
> of crash. It didn't make it into the kernel, and I got distracted by other things.
> 
> If we had that, it would have been easier to diagnose this regression (Shaui
> Xie would have seen crashes with a stack trace pointing to code that used
> to recover in older kernels). Folks with big clusters would also be able to
> point out other places where the kernel crashes often enough that additional
> EXTABLE recovery paths would be worth investigating.
> 
> So:
> 
> 1) We need to fix the regressions. That just needs new commit messages
> for these patches that explain the issue better.
> 
> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
> 
> 3) I don't see much value in a message that reports the recoverable case.
> 
> Yazen: At one point I think you said you were looking at adding additional
> decorations to the return value from mce_severity() to indicate actions
> needed for recoverable errors (kill the process, offline the page) rather
> than have do_machine_check() figure it out by looking at various fields
> in the "struct mce". Did that go anywhere? Those extra details might be
> interesting in the tracepoint.
> 

Hi Tony,

Yes, I have a patch here:
https://github.com/AMDESE/linux/commit/cf0b8a97240abf0fbd98a91cd8deb262f827721b

Branch:
https://github.com/AMDESE/linux/commits/wip-mca/

This work is at the tail-end of a lot of other refactoring. But it can
be prioritized if there's interest. Most of the dependencies have
already been merged.

Thanks,
Yazen




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux