On Fri, Feb 07, 2025 at 04:27:09PM +0100, Jann Horn wrote: > On Sun, Feb 2, 2025 at 5:29 PM Eyal Birger <eyal.birger@xxxxxxxxx> wrote: > > uretprobe(2) is an performance enhancement system call added to improve > > uretprobes on x86_64. > > > > Confinement environments such as Docker are not aware of this new system > > call and kill confined processes when uretprobes are attached to them. > > FYI, you might have similar issues with Syscall User Dispatch > (https://docs.kernel.org/admin-guide/syscall-user-dispatch.html) and > potentially also with ptrace-based sandboxes, depending on what kinda > processes you inject uprobes into. For Syscall User Dispatch, there is > already precedent for a bypass based on instruction pointer (see > syscall_user_dispatch()). > > > Since uretprobe is a "kernel implementation detail" system call which is > > not used by userspace application code directly, pass this system call > > through seccomp without forcing existing userspace confinement environments > > to be changed. > > This makes me feel kinda uncomfortable. The purpose of seccomp() is > that you can create a process that is as locked down as you want; you > can use it for some light limits on what a process can do (like in > Docker), or you can use it to make a process that has access to > essentially nothing except read(), write() and exit_group(). Even > stuff like restart_syscall() and rt_sigreturn() is not currently > excepted from that. > > I guess your usecase is a little special in that you were already > calling from userspace into the kernel with SWBP before, which is also > not subject to seccomp; and the syscall is essentially an > arch-specific hack to make the SWBP a little faster. > > If we do this, we should at least ensure that there is absolutely no > way for anything to happen in sys_uretprobe when no uretprobes are > configured for the process - the first check in the syscall > implementation almost does that, but the implementation could be a bit > stricter. It checks for "regs->ip != trampoline_check_ip()", but if no > uprobe region exists for the process, trampoline_check_ip() returns > `-1 + (uretprobe_syscall_check - uretprobe_trampoline_entry)`. So > there is a userspace instruction pointer near the bottom of the > address space that is allowed to call into the syscall if uretprobes > are not set up. Though the mmap minimum address restrictions will > typically prevent creating mappings there, and > uprobe_handle_trampoline() will SIGILL us if we get that far without a > valid uretprobe. nice catch, I think change below should fix that thanks, jirka --- diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c index 0c74a4d4df65..9b8837d8f06e 100644 --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -368,19 +368,21 @@ void *arch_uretprobe_trampoline(unsigned long *psize) return &insn; } -static unsigned long trampoline_check_ip(void) +static unsigned long trampoline_check_ip(unsigned long tramp) { - unsigned long tramp = uprobe_get_trampoline_vaddr(); - return tramp + (uretprobe_syscall_check - uretprobe_trampoline_entry); } SYSCALL_DEFINE0(uretprobe) { struct pt_regs *regs = task_pt_regs(current); - unsigned long err, ip, sp, r11_cx_ax[3]; + unsigned long err, ip, sp, r11_cx_ax[3], tramp; + + tramp = uprobe_get_trampoline_vaddr(); + if (tramp == -1) + goto sigill; - if (regs->ip != trampoline_check_ip()) + if (regs->ip != trampoline_check_ip(tramp)) goto sigill; err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));