On Fri, Nov 27, 2020 at 02:32:34PM -0500, Gabriel Krisman Bertazi wrote: > Introduce a mechanism to quickly disable/enable syscall handling for a > specific process and redirect to userspace via SIGSYS. This is useful > for processes with parts that require syscall redirection and parts that > don't, but who need to perform this boundary crossing really fast, > without paying the cost of a system call to reconfigure syscall handling > on each boundary transition. This is particularly important for Windows > games running over Wine. > > The proposed interface looks like this: > > prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector]) > > The range [<offset>,<offset>+<length>) is a part of the process memory > map that is allowed to by-pass the redirection code and dispatch > syscalls directly, such that in fast paths a process doesn't need to > disable the trap nor the kernel has to check the selector. This is > essential to return from SIGSYS to a blocked area without triggering > another SIGSYS from rt_sigreturn. > > selector is an optional pointer to a char-sized userspace memory region > that has a key switch for the mechanism. This key switch is set to > either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the > redirection without calling the kernel. > > The feature is meant to be set per-thread and it is disabled on > fork/clone/execv. > > Internally, this doesn't add overhead to the syscall hot path, and it > requires very little per-architecture support. I avoided using seccomp, > even though it duplicates some functionality, due to previous feedback > that maybe it shouldn't mix with seccomp since it is not a security > mechanism. And obviously, this should never be considered a security > mechanism, since any part of the program can by-pass it by using the > syscall dispatcher. > > For the sysinfo benchmark, which measures the overhead added to > executing a native syscall that doesn't require interception, the > overhead using only the direct dispatcher region to issue syscalls is > pretty much irrelevant. The overhead of using the selector goes around > 40ns for a native (unredirected) syscall in my system, and it is (as > expected) dominated by the supervisor-mode user-address access. In > fact, with SMAP off, the overhead is consistently less than 5ns on my > test box. > > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Cc: Andy Lutomirski <luto@xxxxxxxxxx> > Cc: Paul Gofman <gofmanp@xxxxxxxxx> > Cc: Kees Cook <keescook@xxxxxxxxxxxx> > Cc: linux-api@xxxxxxxxxxxxxxx > Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx> Acked-by: Kees Cook <keescook@xxxxxxxxxxxx> -- Kees Cook