On Mon, Mar 03, 2025 at 04:49:25PM +0000, Luck, Tony wrote: > > The error context is in the behavior of the hw. If the error is fatal, you > > won't see it - the machine will panic or do something else to prevent error > > propagation. It definitely won't run any software anymore. > > > > If you see the error getting logged, it means it is not fatal enough to kill > > the machine. > > One place in the fatal case where I would like to see more information is the > > "Action required: data load in error *UN*recoverable area of kernel" > > [emphasis on the "UN" added]. > > case. We have a few places where the kernel does recover. And most places > we crash. Our code for the recoverable cases is fragile. Most of this series is > about repairing regressions where we used to recover from places where kernel > is doing get_user() or copy_from_user() which can be recovered if those places > get an error return and the kernel kills the process instead of crashing. > > A long time ago I posted some patches to include a stack trace for this type > of crash. It didn't make it into the kernel, and I got distracted by other things. > > If we had that, it would have been easier to diagnose this regression (Shaui > Xie would have seen crashes with a stack trace pointing to code that used > to recover in older kernels). Folks with big clusters would also be able to > point out other places where the kernel crashes often enough that additional > EXTABLE recovery paths would be worth investigating. > > So: > > 1) We need to fix the regressions. That just needs new commit messages > for these patches that explain the issue better. > > 2) I'd like to see a patch for a stack trace for the unrecoverable case. > > 3) I don't see much value in a message that reports the recoverable case. > > Yazen: At one point I think you said you were looking at adding additional > decorations to the return value from mce_severity() to indicate actions > needed for recoverable errors (kill the process, offline the page) rather > than have do_machine_check() figure it out by looking at various fields > in the "struct mce". Did that go anywhere? Those extra details might be > interesting in the tracepoint. > Hi Tony, Yes, I have a patch here: https://github.com/AMDESE/linux/commit/cf0b8a97240abf0fbd98a91cd8deb262f827721b Branch: https://github.com/AMDESE/linux/commits/wip-mca/ This work is at the tail-end of a lot of other refactoring. But it can be prioritized if there's interest. Most of the dependencies have already been merged. Thanks, Yazen