On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote: > On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote: > > This patch series introduces a new clone flag, CLONE_FD, which lets the caller > > handle child process exit notification via a file descriptor rather than > > SIGCHLD. CLONE_FD makes it possible for libraries to safely launch and manage > > child processes on behalf of their caller, *without* taking over process-wide > > SIGCHLD handling (either via signal handler or signalfd). > > Hi Josh, > > From the overall description (i.e. I haven't looked at the code yet) > this looks very interesting. However, it seems to cover a lot of the > same ground as the process descriptor feature that was added to FreeBSD > in 9.x/10.x: > https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2 Interesting. > I think it would ideally be nice for a userspace library developer to be > able to do subprocess management (without SIGCHLD) in a similar way > across both platforms, without lots of complicated autoconf shenanigans. > > So could we look at the overlap and seeing if we can come up with > something that covers your requirements and also allows for something > that looks like FreeBSD's process descriptors? Agreed; however, I think it's reasonable to provide appropriate Linux system calls, and then let glibc or libbsd or similar provide the BSD-compatible calls on top of those. I don't think the kernel interface needs to exactly match FreeBSD's, as long as it's a superset of the functionality. For example, pdfork can just call clone4 with CLONE_FD and return the resulting file descriptor. In my further comments below, I'll suggest ways that the FreeBSD library calls could be implemented on top of Linux system calls. > (I've actually got some rough patches to add process descriptor > functionality on Linux, so I can look at how the two approaches compare > and contrast.) > > > Note that signalfd for SIGCHLD does not suffice here, because that still > > receives notification for all child processes, and interferes with process-wide > > signal handling. > > > > The CLONE_FD file descriptor uniquely identifies a process on the system in a > > race-free way, by holding a reference to the task_struct. In the future, we > > may introduce APIs that support using process file descriptors instead of PIDs. > > FreeBSD has pdkill(2) and (theoretically) pdwait4(2) along these lines. > I suspect we need either need pdkill(2) or a way to retrieve a PID from > a process file descriptor, so that there's a way to send signals to the > child. The original caller of clone4 with CLONE_FD can pass CLONE_PARENT_SETTID to get the PID. In the future, I plan to add an fd-based equivalent of rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine whether to kill a process or thread) which is a superset of pdkill. pdkill could then call that and just not pass the extra info. A fair bit of pdwait4 could be implemented on top of read(), other than the full rusage information (see below), and the ability to wait for STOP/CONT (which the CLONE_FD file descriptor could support if desired, but it'd have to be set via a flag at clone time). I think it's a feature to use read() rather than an additional magic system call. > > Introducing CLONE_FD required two additional bits of yak shaving: Since clone > > has no more usable flags (with the three currently unused flags unusable > > because old kernels ignore them without EINVAL), also introduce a new clone4 > > system call with more flag bits and an extensible argument structure. And > > since the magic pt_regs-based syscall argument processing for clone's tls > > argument would otherwise prevent introducing a sane clone4 system call, fix > > that too. > > > > I tested the CLONE_SETTLS changes with a thread-local storage test program (two > > threads independently reading and writing a __thread variable), on both 32-bit > > and 64-bit, and I observed no issues there. > > Worth preserving in tools/testing/selftests/ ? Not really; it's just the following trivial program, which was faster to write than to attempt to find somewhere: #include <pthread.h> #include <stdio.h> __thread unsigned x = 0; void *thread_func(void *unused) { unsigned *tx = &x; for (; *tx < 10; (*tx)++) printf("child: tx=%p *tx=%u\n", tx, *tx); return NULL; } int main(void) { unsigned *tx = &x; pthread_t thread; pthread_create(&thread, NULL, thread_func, NULL); for (; *tx < 10; (*tx)++) printf("main: tx=%p *tx=%u\n", tx, *tx); pthread_join(thread, NULL); return 0; } (I didn't bother with error handling, because I ran it under strace.) > > I tested clone4 and the new CLONE_FD call with several additional test > > programs, launching either a process or thread (in the former case using > > syscall(), in the latter case by calling clone4 via assembly and returning to > > C), sleeping in parent and child to test the case of either exiting first, and > > then printing the received clone4_info structure. Thiago also tested clone4 > > with CLONE_FD with a modified version of libqt's process handling, which > > includes a test suite. > > > > I've also included the manpages patch at the end of this series. (Note that > > the manpage documents the behavior of the future glibc wrapper as well as the > > raw syscall.) Here's a formatted plain-text version of the manpage for > > reference: > > FYI, I've added some comparisons with the FreeBSD equivalents below. Thanks! > > CLONE4(2) Linux Programmer's Manual CLONE4(2) > > > > > > > > NAME > > clone4 - create a child process > > > > SYNOPSIS > > /* Prototype for the glibc wrapper function */ > > > > #define _GNU_SOURCE > > #include <sched.h> > > > > int clone4(uint64_t flags, > > size_t args_size, > > struct clone4_args *args, > > int (*fn)(void *), void *arg); > > > > /* Prototype for the raw system call */ > > > > int clone4(unsigned flags_high, unsigned flags_low, > > unsigned long args_size, > > struct clone4_args *args); > > > > struct clone4_args { > > pid_t *ptid; > > pid_t *ctid; > > unsigned long stack_start; > > unsigned long stack_size; > > unsigned long tls; > > }; > > > > > > DESCRIPTION > > clone4() creates a new process, similar to clone(2) and fork(2). > > clone4() supports additional flags that clone(2) does not, and accepts > > arguments via an extensible structure. > > > > args points to a clone4_args structure, and args_size must contain the > > size of that structure, as understood by the caller. If the caller > > passes a shorter structure than the kernel expects, the remaining > > fields will default to 0. If the caller passes a larger structure than > > the kernel expects (such as one from a newer kernel), clone4() will > > return EINVAL. The clone4_args structure may gain additional fields at > > the end in the future, and callers must only pass a size that encom‐ > > passes the number of fields they understand. If the caller passes 0 > > for args_size, args is ignored and may be NULL. > > > > In the clone4_args structure, ptid, ctid, stack_start, stack_size, and > > tls have the same semantics as they do with clone(2) and clone2(2). > > > > In the glibc wrapper, fn and arg have the same semantics as they do > > with clone(2). As with clone(2), the underlying system call works more > > like fork(2), returning 0 in the child process; the glibc wrapper sim‐ > > plifies thread execution by calling fn(arg) and exiting the child when > > that function exits. > > > > The 64-bit flags argument (split into the 32-bit flags_high and > > flags_low arguments in the kernel interface) accepts all the same flags > > as clone(2), with the exception of the obsolete CLONE_PID, > > CLONE_DETACHED, and CLONE_STOPPED. In addition, flags accepts the fol‐ > > lowing flags: > > > > > > CLONE_FD > > Instead of returning a process ID, clone4() with the CLONE_FD > > flag returns a file descriptor associated with the new process. > > When the new process exits, the kernel will not send a signal to > > the parent process, and will not keep the new process around as > > a "zombie" process until a call to waitpid(2) or similar. > > Instead, the file descriptor will become available for reading, > > and the new process will be immediately reaped. > > Just to confirm: presumably a waitpid(-1,...) call that's already in > progress won't return when one of these child processes exits? I agree, I don't think it should. Because otherwise you'd also assume you can waitpid() on the PID itself, and that'd be a race condition since the process autoreaps. > > Unlike using signalfd(2) for the SIGCHLD signal, the file > > descriptor returned by clone4() with the CLONE_FD flag works > > even with SIGCHLD unblocked in one or more threads of the parent > > process, and allows the process to have different handlers for > > different child processes, such as those created by a library, > > without introducing race conditions around process-wide signal > > handling. > > > > clone4() will never return a file descriptor in the range 0-2 to > > the caller, to avoid ambiguity with the return of 0 in the child > > process. Only the calling process will have the new file > > descriptor open; the child process will not. > > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to > return the file descriptor separately, which avoids the need for special > case processing for low FD values (and means that POSIX's "lowest file > descriptor not currently open" behaviour can be preserved if desired). That'd be easy to implement if desired, by adding an outbound pointer to clone4_args. The (very mild) reason I'd dropped the PID: with CLONE_FD and future syscalls that use the fd as an identifier, PIDs can hopefully become mostly unnecessary. However, I'm not that attached to changing the return value; it'd be trivial to switch to an outbound parameter instead, and then drop the "not 0-2". > > Since the kernel does not send a termination signal when a child > > process created with CLONE_FD exits, the low byte of flags does > > not contain a signal number. Instead, the low byte of flags can > > contain the following additional flags for use with CLONE_FD: > > > > > > CLONEFD_CLOEXEC > > Set the O_CLOEXEC flag on the new open file descriptor. > > See the description of the O_CLOEXEC flag in open(2) for > > reasons why this may be useful. > > > > > > CLONEFD_NONBLOCK > > Set the O_NONBLOCK flag on the new open file descriptor. > > Using this flag saves extra calls to fcntl(2) to achieve > > the same result. > > > > > > clone4() with the CLONE_FD flag returns a file descriptor that > > supports the following operations: > > > > read(2) (and similar) > > When the new process exits, reading from the file > > descriptor produces a single clonefd_info structure: > > > > struct clonefd_info { > > uint32_t code; /* Signal code */ > > uint32_t status; /* Exit status or signal */ > > uint64_t utime; /* User CPU time */ > > uint64_t stime; /* System CPU time */ > > }; > > Presumably there is no way to get full rusage information for the exited > process? I focused on the information available via SIGCHLD. Even utime and stime are unnecessary for the primary use case of CLONE_FD, but I included them because SIGCHLD does. I'd like to avoid sending the much larger rusage over the file descriptor when the caller may not care. However, given that the task_struct sticks around as long as the CLONE_FD file descriptor does, if that information is normally still available from a dead-but-not-waited-on process, it should be trivial to add an operation that takes the file descriptor and returns the full rusage, if someone needs that. I think that can be done as part of a later patch series adding other operations for use with the file descriptor, though. > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a > process descriptor, including rusage retrieval. However, I don't think > they actually implemented it: > http://fxr.watson.org/fxr/source/kern/syscalls.master#L928] That's a pretty good argument that we don't need to either, at least not yet. > > If the new process has not yet exited, read(2) either > > blocks until it does, or fails with the error EAGAIN if > > the file descriptor has been made nonblocking. > > > > Future kernels may extend clonefd_info by appending addi‐ > > tional fields to the end. Callers should read as many > > bytes as they understand; unread data will be discarded, > > and subsequent reads after the first will return 0 to > > indicate end-of-file. Callers requesting more bytes than > > the kernel provides (such as callers expecting a newer > > clonefd_info structure) will receive a shorter structure > > from older kernels. > > FreeBSD also implements fstat(2) for its process descriptors, although > only a few of the fields get filled in. I looked at what they provide, and that seems like more of a novelty than something particularly useful (since most of the stat fields aren't meaningful), but if that's useful for compatibility then adding it seems fine. > > poll(2), select(2), epoll(7) (and similar) > > The file descriptor is readable (the select(2) readfds > > argument; the poll(2) POLLIN flag) if the new process has > > exited. > > FreeBSD uses POLLHUP here. That makes sense given that they provide the information via a separate call rather than read. Since the CLONE_FD file descriptor uses read, it needs to provide POLLIN, but I have no objection to using *both* POLLIN and POLLHUP if that'd be at all useful. > > close(2) > > When the file descriptor is no longer required it should > > be closed. If no process has a file descriptor open for > > the new process, no process will receive any notification > > when the new process exits. The new process will still > > be immediately reaped. > > FreeBSD has two different behaviours for close(2), depending on a flag > value (PD_DAEMON). With the flag set it's roughly like this, but > without PD_DAEMON a close(2) operation on the (last open) file > descriptor terminates the child process. > > This can be quite useful, particularly for the use case where some > userspace library has an FD-controlled subprocess -- if the application > using the library terminates, the process descriptor is closed and so > the subprocess is automatically terminated. That's an interesting idea. I don't think it makes sense for that to be the default behavior, but if someone wanted to add an additional flag to implement that behavior, that seems fine. A FreeBSD-compatible pdfork could then use that flag when not passed PD_DAEMON and not use it when passed PD_DAEMON. How does it kill the process when the last open descriptor closes? SIGKILL? SIGTERM? The former seems unfriendly (preventing graceful termination), and the latter blockable. There's a reason init systems send TERM, then wait, then KILL. - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html