On Wed, May 01, 2024 at 03:53:40AM +0200, Andrew Zaborowski wrote: > Uncorrected memory errors are signaled to processes using SIGBUS or an > error retval from a syscall. But there's a corner cases in execve where > a SIGSEGV will be delivered. Specifically this will happen if the binary > loader triggers a memory error reading user pages. The architecture's > handler (MCE handler on x86) may queue a call to memory_failure but that > won't run until the execve() returns. The binary loader is called after > bprm->point_of_no_return is set meaning that any error is handled by > bprm_execve() with a SIGSEGV to the process. Why is it needed to have a distinction between SIGBUS and SIGSEGV in this case? > To ensure it is terminated with a SIGBUS we 1. let pending work run in > the bprm_execve error case. > > And 2. ensure memory_failure() is passed MF_ACTION_REQUIRED so that the > SIGBUS is queued. Normally when the MCE is in a syscall, a fixup of > return IP and a call to kill_me_never are enough. But in this case > it's necessary to queue kill_me_maybe() which will set > MF_ACTION_REQUIRED. > > Reuse current->in_execve to make the decision. We're actually in the process of trying to remove[1] this flag, so I'd like to avoid adding new users of it. It sounds like it's desirable here because a choice is needed about kill_me_never() vs kill_me_maybe()? -Kees [1] https://lore.kernel.org/lkml/8fafb8e1-b6be-4d08-945f-b464e3a396c8@xxxxxxxxxxxxxxxxxxx/ > > Signed-off-by: Andrew Zaborowski <andrew.zaborowski@xxxxxxxxx> > --- > arch/x86/kernel/cpu/mce/core.c | 14 ++++++++++++++ > fs/exec.c | 12 +++++++++--- > include/linux/sched.h | 2 +- > 3 files changed, 24 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 84d41be6d06b..11effdff942c 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -1593,6 +1593,20 @@ noinstr void do_machine_check(struct pt_regs *regs) > else > queue_task_work(&m, msg, kill_me_maybe); > > + } else if (current->in_execve) { > + /* > + * Cannot recover a task in execve() beyond point of no > + * return but stop further user memory accesses. > + */ > + if (m.kflags & MCE_IN_KERNEL_RECOV) { > + if (!fixup_exception(regs, X86_TRAP_MC, 0, 0)) > + mce_panic("Failed kernel mode recovery", &m, msg); > + } > + > + if (!mce_usable_address(&m)) > + queue_task_work(&m, msg, kill_me_now); > + else > + queue_task_work(&m, msg, kill_me_maybe); > } else { > /* > * Handle an MCE which has happened in kernel space but from > diff --git a/fs/exec.c b/fs/exec.c > index cf1df7f16e55..1bea9c252a11 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -67,6 +67,7 @@ > #include <linux/time_namespace.h> > #include <linux/user_events.h> > #include <linux/rseq.h> > +#include <linux/task_work.h> > > #include <linux/uaccess.h> > #include <asm/mmu_context.h> > @@ -1888,10 +1889,15 @@ static int bprm_execve(struct linux_binprm *bprm) > * If past the point of no return ensure the code never > * returns to the userspace process. Use an existing fatal > * signal if present otherwise terminate the process with > - * SIGSEGV. > + * SIGSEGV. Run pending work before that in case it is > + * terminating the process with a different signal. > */ > - if (bprm->point_of_no_return && !fatal_signal_pending(current)) > - force_fatal_sig(SIGSEGV); > + if (bprm->point_of_no_return) { > + task_work_run(); > + > + if (!fatal_signal_pending(current)) > + force_fatal_sig(SIGSEGV); > + } > > sched_mm_cid_after_execve(current); > current->fs->in_exec = 0; > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 3c2abbc587b4..8970a191d8fe 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -922,7 +922,7 @@ struct task_struct { > unsigned sched_rt_mutex:1; > #endif > > - /* Bit to tell TOMOYO we're in execve(): */ > + /* Bit to tell TOMOYO and x86 MCE code we're in execve(): */ > unsigned in_execve:1; > unsigned in_iowait:1; > #ifndef TIF_RESTORE_SIGMASK > -- > 2.39.3 > -- Kees Cook