RE: [PATCH v2 2/5] x86/mce: dump error msg from severities

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> The error context is in the behavior of the hw. If the error is fatal, you
> won't see it - the machine will panic or do something else to prevent error
> propagation. It definitely won't run any software anymore.
>
> If you see the error getting logged, it means it is not fatal enough to kill
> the machine.

One place in the fatal case where I would like to see more information is the

  "Action required: data load in error *UN*recoverable area of kernel"

[emphasis on the "UN" added].

case.  We have a few places where the kernel does recover. And most places
we crash. Our code for the recoverable cases is fragile. Most of this series is
about repairing regressions where we used to recover from places where kernel
is doing get_user() or copy_from_user() which can be recovered if those places
get an error return and the kernel kills the process instead of crashing.

A long time ago I posted some patches to include a stack trace for this type
of crash. It didn't make it into the kernel, and I got distracted by other things.

If we had that, it would have been easier to diagnose this regression (Shaui
Xie would have seen crashes with a stack trace pointing to code that used
to recover in older kernels). Folks with big clusters would also be able to
point out other places where the kernel crashes often enough that additional
EXTABLE recovery paths would be worth investigating.

So:

1) We need to fix the regressions. That just needs new commit messages
for these patches that explain the issue better.

2) I'd like to see a patch for a stack trace for the unrecoverable case.

3) I don't see much value in a message that reports the recoverable case.

Yazen: At one point I think you said you were looking at adding additional
decorations to the return value from mce_severity() to indicate actions
needed for recoverable errors (kill the process, offline the page) rather
than have do_machine_check() figure it out by looking at various fields
in the "struct mce". Did that go anywhere? Those extra details might be
interesting in the tracepoint.

-Tony





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux