Re: [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE

Andrew Zaborowski <andrew.zaborowski@xxxxxxxxx> · Sat, 10 Aug 2024 03:20:10 +0200

Borislav Petkov <bp@xxxxxxxxx> wrote:
> So instead of the process getting killed, you want to return SIGBUS
> because, "hey caller, your process encountered an MCE while being
> attempted to be executed"?

The tests could be changed to expect the SIGSEGV but in this case it
seemed that the test was good and the kernel was misbehaving.  One of
the authors of the MCE handling code confirmed that.

>
> > Qemu relies on the SIGBUS logic but the execve and rseq
> > cases cannot be recovered from, the main benefit of sending the
> > correct signal is perhaps information to the user.
>
> You will have that info in the logs - we're usually very loud when we
> get an MCE...

True, though that's hard to link to a specific process crash.  It's
also hard to extract the page address in the process's address space
from that, although I don't think there's a current use case.

>
> > If this cannot be fixed then optimally it should be documented.
>
> I'm not convinced at all that jumping through hoops you're doing, is
> worth the effort.

That could be, again this could be fixed in the documentation instead.

>
> > As for "all that code", the memory failure handling code is of certain
> > size and this is a comparatively tiny fix for a tiny issue.
>
> No, I didn't say anything about the memory failure code - it is about

I was replying to your comment about the size of the change.

> supporting that obscure use case and the additional logic you're adding
> to the #MC handler which looks like a real mess already and us having to
> support that use case indefinitely.

Supporting something generally includes supporting the common and the
obscure cases.  From the user's point of view the kernel has been
committed to supporting these scenarios indefinitely or until the
deprecation of the SIGBUS-on-memory-error logic, and simply has a bug.

>
> So why does it matter if a process which is being executed and gets an
> MCE beyond the point of no return absolutely needs to return SIGBUS vs
> it getting killed and you still get an MCE logged on the machine, in
> either case?

A SIGSEGV strongly implies a problem with the program being run, not a
specific instance of it.  A SIGBUS could be not the program's fault,
like in this case.

In these tests the workload was simply relaunched on a SIGBUS which
sounded fair to me.  A qemu VM could similarly be restarted on an
unrecoverable MCE in a page that doesn't belong to the VM but to qemu
itself.

Best regards