On Sun, Jul 12, 2020 at 12:45:15AM -0400, Gabriel Krisman Bertazi wrote: > Introduce a mechanism to quickly disable/enable syscall handling for a > specific process and redirect to userspace via SIGSYS. This is useful > for processes with parts that require syscall redirection and parts that > don't, but who need to perform this boundary crossing really fast, > without paying the cost of a system call to reconfigure syscall handling > on each boundary transition. This is particularly important for Windows > games running over Wine. > > The proposed interface looks like this: > > prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <start_addr>, <end_addr>, [selector]) > > The range [<start_addr>,<end_addr>] is a part of the process memory map > that is allowed to by-pass the redirection code and dispatch syscalls > directly, such that in fast paths a process doesn't need to disable the > trap nor the kernel has to check the selector. This is essential to > return from SIGSYS to a blocked area without triggering another SIGSYS > from rt_sigreturn. > > selector is an optional pointer to a char-sized userspace memory region > that has a key switch for the mechanism. This key switch is set to > either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the > redirection without calling the kernel. > > The feature is meant to be set per-thread and it is disabled on > fork/clone/execv. > > Internally, this doesn't add overhead to the syscall hot path, and it > requires very little per-architecture support. I avoided using seccomp, > even though it duplicates some functionality, due to previous feedback > that maybe it shouldn't mix with seccomp since it is not a security > mechanism. And obviously, this should never be considered a security > mechanism, since any part of the program can by-pass it by using the > syscall dispatcher. > > For the sysinfo benchmark, which measures the overhead added to > executing a native syscall that doesn't require interception, the > overhead using only the direct dispatcher region to issue syscalls is > pretty much irrelevant. The overhead of using the selector goes around > 40ns for a native (unredirected) syscall in my system, and it is (as > expected) dominated by the supervisor-mode user-address access. In > fact, with SMAP off, the overhead is consistently less than 5ns on my > test box. > > Right now, it is only supported by x86_64 and x86, but it should be > easily enabled for other architectures. > > An example code using this interface can be found at: > https://gitlab.collabora.com/krisman/syscall-disable-personality > > Changes since v2: > (Matthew Wilcox suggestions) > - Drop __user on non-ptr type. > - Move #define closer to similar defs > - Allow a memory region that can dispatch directly > (Kees Cook suggestions) > - Improve kconfig summary line > - Move flag cleanup on execve to begin_new_exec > - Hint branch predictor in the syscall path > (Me) > - Convert selector to char > > Changes since RFC: > (Kees Cook suggestions) > - Don't mention personality while explaining the feature > - Use syscall_get_nr > - Remove header guard on several places > - Convert WARN_ON to WARN_ON_ONCE > - Explicit check for state values > - Rename to syscall user dispatcher > > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Cc: Andy Lutomirski <luto@xxxxxxxxxx> > Cc: Paul Gofman <gofmanp@xxxxxxxxx> > Cc: Kees Cook <keescook@xxxxxxxxxxxx> > Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx> I think this looks great. :) Reviewed-by: Kees Cook <keescook@xxxxxxxxxxxx> Any other folks able to look through it? -- Kees Cook