[Sorry if breaking threading] Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > On Mon, Aug 05, 2024 at 09:37:48PM -0700, Kees Cook wrote: > > On Tue, Jul 23, 2024 at 04:47:52PM +0200, Andrew Zaborowski wrote: > > > Uncorrected memory errors for user pages are signaled to processes > > > using SIGBUS or, if the error happens in a syscall, an error retval > > > from the syscall. The SIGBUS is documented in > > > Documentation/mm/hwpoison.rst#failure-recovery-modes > > > > > > Once a user task sets t->rseq in the rseq() syscall, if the kernel > > > cannot access the memory pointed to by t->rseq->rseq_cs, that initial > > > rseq() and all future syscalls should return an error so understandably > > > the code just kills the task. > > > > > > To ensure that SIGBUS is used set the new t->kill_on_efault flag and > > > run queued task work on rseq_get_rseq_cs() errors to give memory_failure > > > the chance to run. > > > > > > Note: the rseq checks run inside resume_user_mode_work() so whenever > > > _TIF_NOTIFY_RESUME is set. They do not run on every syscall exit so > > > I'm not concerned that these extra flag operations are in a hot path, > > > except with CONFIG_DEBUG_RSEQ. > > > > > > Signed-off-by: Andrew Zaborowski <andrew.zaborowski@xxxxxxxxx> > > > --- > > > kernel/rseq.c | 25 +++++++++++++++++++++---- > > > > Can an rseq maintainer please review this? I can carry it via the execve > > tree with the related patches... > > *sigh*,.. because get_maintainers just doesn't work or something? Oops, I should have re-run it on the v2. > > Anyway, I'm confused by the signal code (as always), why isn't the > task_work_run() in get_signal() sufficient? Ok, good point, I only considered the task_work_run() call inside resume_user_mode_work() which was after the signals were already delivered. So the reason it's not sufficient seems to be that, given a SIGSEGV and a SIGBUS, dequeue_synchronous_signal() returns the one that was queued first i.e. the SIGSEGV. dequeue_signal() would have returned the one with lower value, which would have been the SIGBUS. > > At some point we're going to run into trouble with sprinkling > task_work_run() around willy nilly :/ Right, this isn't ideal. get_signal has this comment: /* * Signals generated by the execution of an instruction * need to be delivered before any other pending signals * so that the instruction pointer in the signal stack * frame points to the faulting instruction. */ ... signr = dequeue_synchronous_signal(&ksig->info); if (!signr) signr = dequeue_signal(current, ¤t->blocked, &ksig->info, &type); In this case the two signals will have the same userspace IP. The SIGBUS will have a non-NULL .si_addr and is more specific. I wonder if dequeue_synchronous_signal should have extra logic to prefer the SIGBUS, or if memory_failure() should try to put the SIGBUS at the top of current->pending.list. In any case the code would need to be careful when doing this and I can't think of the exact conditions to be checked right now. Best regards