Gabriel, Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx> writes: > Introduce a mechanism to quickly disable/enable syscall handling for a > specific process and redirect to userspace via SIGSYS. This is useful > for processes with parts that require syscall redirection and parts that > don't, but who need to perform this boundary crossing really fast, > without paying the cost of a system call to reconfigure syscall handling > on each boundary transition. This is particularly important for Windows > games running over Wine. > > The proposed interface looks like this: > > prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <start_addr>, <end_addr>, [selector]) > > The range [<start_addr>,<end_addr>] is a part of the process memory map > that is allowed to by-pass the redirection code and dispatch syscalls > directly, such that in fast paths a process doesn't need to disable the > trap nor the kernel has to check the selector. This is essential to > return from SIGSYS to a blocked area without triggering another SIGSYS > from rt_sigreturn. Why isn't rt_sigreturn() exempt from that redirection in the first place? > --- > arch/Kconfig | 20 ++++++ > arch/x86/Kconfig | 1 + > arch/x86/entry/common.c | 5 ++ > arch/x86/include/asm/thread_info.h | 4 +- > arch/x86/kernel/signal_compat.c | 2 +- > fs/exec.c | 2 + > include/linux/sched.h | 3 + > include/linux/syscall_user_dispatch.h | 50 +++++++++++++++ > include/uapi/asm-generic/siginfo.h | 3 +- > include/uapi/linux/prctl.h | 5 ++ > kernel/Makefile | 1 + > kernel/fork.c | 1 + > kernel/sys.c | 5 ++ > kernel/syscall_user_dispatch.c | 92 +++++++++++++++++++++++++++ A big combo patch is not how we do that. Please split it up into the core part and a patch enabling it for a particular architexture. As I said in my reply to Andy, this wants to go on top of the generic entry/exit work stuff: https://lore.kernel.org/r/20200716182208.180916541@xxxxxxxxxxxxx and then syscall_user_dispatch.c ends up in kernel/entry/ and the dispatching function is not exposed outside of that directory. I'm going to post a new version later today. Will cc you. > --- a/arch/x86/include/asm/thread_info.h > +++ b/arch/x86/include/asm/thread_info.h > @@ -93,6 +93,7 @@ struct thread_info { > #define TIF_NOTSC 16 /* TSC is not accessible in userland */ > #define TIF_IA32 17 /* IA32 compatibility process */ > #define TIF_SLD 18 /* Restore split lock detection on context switch */ > +#define TIF_SYSCALL_USER_DISPATCH 19 /* Redirect syscall for userspace handling */ There are two other things out there which compete about the last TIF bits on x86, so we need to clean that up first. > +static void trigger_sigsys(struct pt_regs *regs) > +{ > + struct kernel_siginfo info; > + > + clear_siginfo(&info); > + info.si_signo = SIGSYS; > + info.si_code = SYS_USER_DISPATCH; > + info.si_call_addr = (void __user *)KSTK_EIP(current); > + info.si_errno = 0; > + info.si_arch = syscall_get_arch(current); > + info.si_syscall = syscall_get_nr(current, regs); > + > + force_sig_info(&info); > +} > + > +int do_syscall_user_dispatch(struct pt_regs *regs) > +{ > + struct syscall_user_dispatch *sd = ¤t->syscall_dispatch; > + unsigned long ip = instruction_pointer(regs); > + char state; > + > + if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end)) > + return 0; > + > + if (likely(sd->selector)) { > + if (unlikely(__get_user(state, sd->selector))) __get_user() mandates an explicit access_ok() which happened in the prctl(). So this wants a comment why there is none right here. > + do_exit(SIGSEGV); > + > + if (likely(state == 0)) > + return 0; > + > + if (state != 1) > + do_exit(SIGSEGV); If that happens its going to be quite interesting to debug. Also please use proper defines which are exposed to user space instead of 0/1. > + } > + > + syscall_rollback(current, regs); > + trigger_sigsys(regs); > + > + return 1; > +} > + > +int set_syscall_user_dispatch(int mode, unsigned long dispatcher_start, > + unsigned long dispatcher_end, char __user *selector) > +{ > + switch (mode) { > + case PR_SYS_DISPATCH_OFF: > + if (dispatcher_start || dispatcher_end || selector) > + return -EINVAL; > + break; > + case PR_SYS_DISPATCH_ON: > + /* > + * Validate the direct dispatcher region just for basic > + * sanity. If the user is able to submit a syscall from > + * an address, that address is obviously valid. > + */ > + if (dispatcher_end < dispatcher_start) > + return -EINVAL; > + > + if (selector && !access_ok(selector, 1)) sizeof(*selector) Thanks, tglx