On Mon, Apr 29, 2019 at 10:50 PM Florian Weimer <fweimer@xxxxxxxxxx> wrote: > > * Jann Horn: > > >> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid, > >> <clone flags and arguments, maybe in a struct>) > >> > >> and then you'd use it like this to fork off a child process: > >> > >> int spawn_shell_subprocess_(void *arg) { > >> char *cmdline = arg; > >> execl("/bin/sh", "sh", "-c", cmdline); > >> return -1; > >> } > >> pid_t spawn_shell_subprocess(char *cmdline) { > >> pid_t child_pid; > >> int res = clone_temporary(spawn_shell_subprocess_, cmdline, > >> &child_pid, [...]); > >> if (res == 0) return child_pid; > >> return res; > >> } > >> > >> clone_temporary() could be implemented roughly as follows by the libc > >> (or other userspace code): > >> > >> sigset_t sigset, sigset_old; > >> sigfillset(&sigset); > >> sigprocmask(SIG_SETMASK, &sigset, &sigset_old); > >> int child_pid; > >> int result = 0; > >> /* starting here, use inline assembly to ensure that no stack > >> allocations occur */ > >> long child = syscall(__NR_clone, > >> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP - > >> ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0); > >> if (child == -1) { result = -1; goto reset_sigmask; } > >> if (child == 0) { > >> result = fn(arg); > >> syscall(__NR_exit, 0); > >> } > >> futex(&child_pid, FUTEX_WAIT, child, NULL); > >> /* end of no-stack-allocations zone */ > >> reset_sigmask: > >> sigprocmask(SIG_SETMASK, &sigset_old, NULL); > >> return result; > > > > ... I guess that already has a name, and it's called vfork(). (Well, > > except that the Linux vfork() isn't a real vfork().) > > > > So I guess my question is: Why not vfork()? > > Mainly because some users want access to the clone flags, and that's not > possible with the current userspace wrappers. The stack setup for the > undocumented clone wrapper is also cumbersome, and the ia64 pecularity > annoying. > > For the stack sharing, the callback-based interface looks like the > absolutely right thing to do to me. It enforces the notion that you can > safely return on the child path from a function calling vfork. > > > And if vfork() alone isn't flexible enough, alternatively: How about > > an API that forks a new child in the same address space, and then > > allows the parent to invoke arbitrary syscalls in the context of the > > child? > > As long it's not an eBPF script … You shouldn't even joke about this (I'm serious.). I'm very certain there are people who'd think this is a good idea. > > > You could also build that in userspace if you wanted, I think - just > > let the child run an assembly loop that reads registers from a unix > > seqpacket socket, invokes the syscall instruction, and writes the > > value of the result register back into the seqpacket socket. As long > > as you use CLONE_VM, you don't have to worry about moving the pointer > > targets of syscalls. The user-visible API could look like this: > > People already use a variant of this, execve'ing twice. See > jspawnhelper. > > Thanks, > Florian