Re: [RESEND][PATCH 3/3] rseq: Ensure SIGBUS delivered on memory failure

Andrew Zaborowski <andrew.zaborowski@xxxxxxxxx> · Thu, 8 Aug 2024 04:33:31 +0200

[Sorry if breaking threading]

Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, Aug 05, 2024 at 09:37:48PM -0700, Kees Cook wrote:
> > On Tue, Jul 23, 2024 at 04:47:52PM +0200, Andrew Zaborowski wrote:
> > > Uncorrected memory errors for user pages are signaled to processes
> > > using SIGBUS or, if the error happens in a syscall, an error retval
> > > from the syscall.  The SIGBUS is documented in
> > > Documentation/mm/hwpoison.rst#failure-recovery-modes
> > >
> > > Once a user task sets t->rseq in the rseq() syscall, if the kernel
> > > cannot access the memory pointed to by t->rseq->rseq_cs, that initial
> > > rseq() and all future syscalls should return an error so understandably
> > > the code just kills the task.
> > >
> > > To ensure that SIGBUS is used set the new t->kill_on_efault flag and
> > > run queued task work on rseq_get_rseq_cs() errors to give memory_failure
> > > the chance to run.
> > >
> > > Note: the rseq checks run inside resume_user_mode_work() so whenever
> > > _TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
> > > I'm not concerned that these extra flag operations are in a hot path,
> > > except with CONFIG_DEBUG_RSEQ.
> > >
> > > Signed-off-by: Andrew Zaborowski <andrew.zaborowski@xxxxxxxxx>
> > > ---
> > >  kernel/rseq.c | 25 +++++++++++++++++++++----
> >
> > Can an rseq maintainer please review this? I can carry it via the execve
> > tree with the related patches...
>
> *sigh*,.. because get_maintainers just doesn't work or something?

Oops, I should have re-run it on the v2.

>
> Anyway, I'm confused by the signal code (as always), why isn't the
> task_work_run() in get_signal() sufficient?

Ok, good point, I only considered the task_work_run() call inside
resume_user_mode_work() which was after the signals were already
delivered.

So the reason it's not sufficient seems to be that, given a SIGSEGV
and a SIGBUS, dequeue_synchronous_signal() returns the one that was
queued first i.e. the SIGSEGV.

dequeue_signal() would have returned the one with lower value, which
would have been the SIGBUS.

>
> At some point we're going to run into trouble with sprinkling
> task_work_run() around willy nilly :/

Right, this isn't ideal.

get_signal has this comment:

/*
 * Signals generated by the execution of an instruction
 * need to be delivered before any other pending signals
 * so that the instruction pointer in the signal stack
 * frame points to the faulting instruction.
 */
...
signr = dequeue_synchronous_signal(&ksig->info);
if (!signr)
    signr = dequeue_signal(current, &current->blocked, &ksig->info, &type);

In this case the two signals will have the same userspace IP.  The
SIGBUS will have a non-NULL .si_addr and is more specific.

I wonder if dequeue_synchronous_signal should have extra logic to
prefer the SIGBUS, or if memory_failure() should try to put the SIGBUS
at the top of current->pending.list.  In any case the code would need
to be careful when doing this and I can't think of the exact
conditions to be checked right now.

Best regards