Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE

Borislav Petkov <bp@xxxxxxxxx> · Sat, 10 Aug 2024 05:21:49 +0200

On Sat, Aug 10, 2024 at 03:20:10AM +0200, Andrew Zaborowski wrote:
> True, though that's hard to link to a specific process crash.

The process name when the MCE gets reported is *actually* there in the
splat: current->comm.

> I was replying to your comment about the size of the change.

My comment was about the code you're adding:

 arch/x86/kernel/cpu/mce/core.c | 18 +++++++++++++++++-
 fs/exec.c                      | 15 ++++++++++++---
 include/linux/sched.h          |  2 ++
 kernel/rseq.c                  | 25 +++++++++++++++++++++----
 4 files changed, 52 insertions(+), 8 deletions(-)

If it is in drivers, I don't care. But it is in main paths and for
a questionable use case.

> Supporting something generally includes supporting the common and the
> obscure cases.

Bullshit. We certainly won't support some obscure use cases just
because.

> From the user's point of view the kernel has been committed to
> supporting these scenarios indefinitely or until the deprecation of
> the SIGBUS-on-memory-error logic, and simply has a bug.

And lemme repeat my question:

So why does it matter if a process which is being executed and gets an
MCE beyond the point of no return absolutely needs to return SIGBUS vs
it getting killed and you still get an MCE logged on the machine, in
either case?

Bug which is seen by whom or what?

If a process dies, it dies.

> In these tests the workload was simply relaunched on a SIGBUS which
> sounded fair to me.  A qemu VM could similarly be restarted on an
> unrecoverable MCE in a page that doesn't belong to the VM but to qemu
> itself.

If an MCE hits at that particular point once in a blue moon, I don't
care. If it is a special use case where you inject an MCE right then and
there to test recovery actions, then that's perhaps a different story.

Usually, a lot of things can be done. As long as there's a valid use
case to support. But since you hesitate to explain what exactly we're
supporting, I can't help you.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette