On Sun, Mar 15, 2015 at 12:59 AM, Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote: > This patch series introduces a new clone flag, CLONE_FD, which lets the caller > receive child process exit notification via a file descriptor rather than > SIGCHLD. CLONE_FD makes it possible for libraries to safely launch and manage > child processes on behalf of their caller, *without* taking over process-wide > SIGCHLD handling (either via signal handler or signalfd). > > Note that signalfd for SIGCHLD does not suffice here, because that still > receives notification for all child processes, and interferes with process-wide > signal handling. > > The CLONE_FD file descriptor uniquely identifies a process on the system in a > race-free way, by holding a reference to the task_struct. In the future, we > may introduce APIs that support using process file descriptors instead of PIDs. > > This patch series also introduces a clone flag CLONE_AUTOREAP, which causes the > kernel to automatically reap the child process when it exits, just as it does > for processes using SIGCHLD when the parent has SIGCHLD ignored or marked as > SA_NOCLDSTOP. > > Taken together, a library can launch a process with CLONE_FD, CLONE_AUTOREAP, > and no exit signal, and completely avoid affecting either process-wide signal > handling or an existing child wait loop. > > Introducing CLONE_FD and CLONE_AUTOREAP required two additional bits of yak > shaving: Since clone has no more usable flags (with the three currently unused > flags unusable because old kernels ignore them without EINVAL), also introduce > a new clone4 system call with more flag bits and an extensible argument > structure. And since the magic pt_regs-based syscall argument processing for > clone's tls argument would otherwise prevent introducing a sane clone4 system > call, fix that too. > > I tested the CLONE_SETTLS changes with a thread-local storage test program (two > threads independently reading and writing a __thread variable), on both 32-bit > and 64-bit, and I observed no issues there. > > I tested clone4 and the new flags with several additional test programs, > launching either a process or thread (in the former case using syscall(), in > the latter case by calling clone4 via assembly and returning to C), sleeping in > parent and child to test the case of either exiting first, and then printing > the received clone4_info structure. > > Changes in v2: > - Split out autoreaping into a separate CLONE_AUTOREAP. CLONE_FD no longer > implies autoreaping and no exit signal, and CLONE_AUTOREAP does not affect > ptracers or signal handling. Thanks to Oleg Nesterov for careful > investigation and discussion on v1. > - Accept O_CLOEXEC and O_NONBLOCK via a clonefd_flags parameter in clone4_args. > Stop overloading the low byte of the main clone flags, since CLONE_FD now > works with a non-zero signal. > - Return the file descriptor via an out parameter in clone4_args. > - Drop patch to export alloc_fd; CLONE_FD now uses the next available file > descriptor, even if that's 0-2, since clone4 no longer needs to avoid > ambiguity with the 0 return indicating the child process. > - Make poll on a CLONE_FD for an exited task also return POLLHUP, for > compatibility with FreeBSD's pdfork. Thanks to David Drysdale for calling > attention to pdfork. I think POLLHUP should be mentioned in the manpage (now it only mentions POLLIN). > - Fix typo in squelch_clone_flags. > - Pass arguments to _do_fork and copy_process as a structure. > - Construct the 64-bit flags in a separate variable, rather than inline in the > call to do_fork. > - Fix error return for copy_from_user faults. > - Add the new syscall to asm-generic. > - Add ack from Andy Lutomirski to patches 1 and 2. > > I've included the manpages patch at the end of this series. (Note that the > manpage documents the behavior of the future glibc wrapper as well as the raw > syscall.) Here's a formatted plain-text version of the manpage for reference: > > CLONE4(2) Linux Programmer's Manual CLONE4(2) > > > > NAME > clone4 - create a child process > > SYNOPSIS > /* Prototype for the glibc wrapper function */ > > #define _GNU_SOURCE > #include <sched.h> > > int clone4(uint64_t flags, > size_t args_size, > struct clone4_args *args, > int (*fn)(void *), void *arg); > > /* Prototype for the raw system call */ > > int clone4(unsigned flags_high, unsigned flags_low, > unsigned long args_size, > struct clone4_args *args); > > struct clone4_args { > pid_t *ptid; > pid_t *ctid; > unsigned long stack_start; > unsigned long stack_size; > unsigned long tls; > int *clonefd; > unsigned clonefd_flags; > }; > > > DESCRIPTION > clone4() creates a new process, similar to clone(2) and fork(2). > clone4() supports additional flags that clone(2) does not, and accepts > arguments via an extensible structure. > > args points to a clone4_args structure, and args_size must contain the > size of that structure, as understood by the caller. If the caller > passes a shorter structure than the kernel expects, the remaining > fields will default to 0. If the caller passes a larger structure than > the kernel expects (such as one from a newer kernel), clone4() will > return EINVAL. The clone4_args structure may gain additional fields at > the end in the future, and callers must only pass a size that encom‐ > passes the number of fields they understand. If the caller passes 0 > for args_size, args is ignored and may be NULL. > > In the clone4_args structure, ptid, ctid, stack_start, stack_size, and > tls have the same semantics as they do with clone(2) and clone2(2). > > In the glibc wrapper, fn and arg have the same semantics as they do > with clone(2). As with clone(2), the underlying system call works more > like fork(2), returning 0 in the child process; the glibc wrapper sim‐ > plifies thread execution by calling fn(arg) and exiting the child when > that function exits. > > The 64-bit flags argument (split into the 32-bit flags_high and > flags_low arguments in the kernel interface for portability across > architectures) accepts all the same flags as clone(2), with the excep‐ > tion of the obsolete CLONE_PID, CLONE_DETACHED, and CLONE_STOPPED. In > addition, flags accepts the following flags: > > > CLONE_AUTOREAP > When the new process exits, immediately reap it, rather than > keeping it around as a "zombie" until a call to waitpid(2) or > similar. Without this flag, the kernel will automatically reap > a process if its exit signal is set to SIGCHLD, and if the par‐ > ent process has SIGCHLD set to SIG_IGN or has a SIGCHLD handler > installed with SA_NOCLDWAIT (see sigaction(2)). CLONE_AUTOREAP > allows the calling process to enable automatic reaping with an > exit signal other than SIGCHLD (including 0 to disable the exit > signal), and does not depend on the configuration of process- > wide signal handling. > > > CLONE_FD > Return a file descriptor associated with the new process, stor‐ > ing it in location clonefd in the parent's address space. When > the new process exits, the file descriptor will become available > for reading. > > Unlike using signalfd(2) for the SIGCHLD signal, the file > descriptor returned by clone4() with the CLONE_FD flag works > even with SIGCHLD unblocked in one or more threads of the parent > process, allowing the process to have different handlers for > different child processes, such as those created by a library, > without introducing race conditions around process-wide signal > handling. > > clonefd_flags may contain the following additional flags for use > with CLONE_FD: > > > O_CLOEXEC > Set the close-on-exec flag on the new file descriptor. > See the description of the O_CLOEXEC flag in open(2) for > reasons why this may be useful. This begs the question: what happens when all CLONE_FD fds for a process are closed? Will the parent get SIGCHLD instead, will it auto-reap, or will it be un-wait-able (I assume not this...) > > > O_NONBLOCK > Set the O_NONBLOCK flag on the new file descriptor. > Using this flag saves extra calls to fcntl(2) to achieve > the same result. > > > The returned file descriptor supports the following operations: > > read(2) (and similar) > When the new process exits, reading from the file > descriptor produces a single clonefd_info structure: > > struct clonefd_info { > uint32_t code; /* Signal code */ > uint32_t status; /* Exit status or signal */ > uint64_t utime; /* User CPU time */ > uint64_t stime; /* System CPU time */ > }; > > > If the new process has not yet exited, read(2) either > blocks until it does, or fails with the error EAGAIN if > the file descriptor has O_NONBLOCK set. > > Future kernels may extend clonefd_info by appending addi‐ > tional fields to the end. Callers should read as many > bytes as they understand; unread data will be discarded, > and subsequent reads after the first will return 0 to > indicate end-of-file. Callers requesting more bytes than > the kernel provides (such as callers expecting a newer > clonefd_info structure) will receive a shorter structure > from older kernels. > > poll(2), select(2), epoll(7) (and similar) > The file descriptor is readable (the select(2) readfds > argument; the poll(2) POLLIN flag) if the new process has > exited. > > close(2) > When the file descriptor is no longer required it should > be closed. > > > C library/kernel ABI differences > As with clone(2), the raw clone4() system call corresponds more closely > to fork(2) in that execution in the child continues from the point of > the call. > > Unlike clone(2), the raw system call interface for clone4() accepts > arguments in the same order on all architectures. > > The raw system call accepts flags as two 32-bit arguments, flags_high > and flags_low, to simplify portability across 32-bit and 64-bit archi‐ > tectures and calling conventions. The glibc wrapper accepts flags as a > single 64-bit argument for convenience. > > > RETURN VALUE > For the glibc wrapper, on success, clone4() returns the new process ID > to the calling process, and the new process begins running at the spec‐ > ified function. > > For the raw syscall, on success, clone4() returns the new process ID to > the calling process, and returns 0 in the new process. > > On failure, clone4() returns -1 and sets errno accordingly. > > > ERRORS > clone4() can return any error from clone(2), as well as the following > additional errors: > > EFAULT args is outside your accessible address space. > > EINVAL flags contained an unknown flag. > > EINVAL flags included CLONE_FD and clonefd_flags contained an unknown > flag. > > EINVAL flags included CLONE_FD, but the kernel configuration does not > have the CONFIG_CLONEFD option enabled. > > EMFILE flags included CLONE_FD, but the new file descriptor would > exceed the process limit on open file descriptors. > > ENFILE flags included CLONE_FD, but the new file descriptor would > exceed the system-wide limit on open file descriptors. > > ENODEV flags included CLONE_FD, but clone4() could not mount the > (internal) anonymous inode device. > > > CONFORMING TO > clone4() is Linux-specific and should not be used in programs intended > to be portable. > > > SEE ALSO > clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2) > > > > Linux 2015-03-14 CLONE4(2) > > > Josh Triplett and Thiago Macieira (7): > clone: Support passing tls argument via C rather than pt_regs magic > x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit > Introduce a new clone4 syscall with more flag bits and extensible arguments > kernel/fork.c: Pass arguments to _do_fork and copy_process using clone4_args > clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process > signal: Factor out a helper function to process task_struct exit_code > clone4: Add a CLONE_FD flag to get task exit notification via fd > > arch/Kconfig | 7 ++ > arch/x86/Kconfig | 1 + > arch/x86/ia32/ia32entry.S | 3 +- > arch/x86/kernel/entry_64.S | 1 + > arch/x86/kernel/process_32.c | 6 +- > arch/x86/kernel/process_64.c | 8 +-- > arch/x86/syscalls/syscall_32.tbl | 1 + > arch/x86/syscalls/syscall_64.tbl | 2 + > include/linux/compat.h | 14 ++++ > include/linux/sched.h | 22 ++++++ > include/linux/syscalls.h | 6 +- > include/uapi/asm-generic/unistd.h | 4 +- > include/uapi/linux/sched.h | 55 ++++++++++++++- > init/Kconfig | 21 ++++++ > kernel/Makefile | 1 + > kernel/clonefd.c | 121 ++++++++++++++++++++++++++++++++ > kernel/clonefd.h | 32 +++++++++ > kernel/exit.c | 4 ++ > kernel/fork.c | 142 ++++++++++++++++++++++++++++++-------- > kernel/signal.c | 26 ++++--- > kernel/sys_ni.c | 1 + > 21 files changed, 426 insertions(+), 52 deletions(-) > create mode 100644 kernel/clonefd.c > create mode 100644 kernel/clonefd.h > > -- > 2.1.4 > Looks promising! -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html