Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

Josh Triplett <josh@xxxxxxxxxxxxxxxx> · Fri, 13 Mar 2015 12:42:52 -0700

On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote:
> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> > handle child process exit notification via a file descriptor rather than
> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> > child processes on behalf of their caller, *without* taking over process-wide
> > SIGCHLD handling (either via signal handler or signalfd).
> 
> Hi Josh,
> 
> From the overall description (i.e. I haven't looked at the code yet)
> this looks very interesting.  However, it seems to cover a lot of the
> same ground as the process descriptor feature that was added to FreeBSD
> in 9.x/10.x:
>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2

Interesting.

> I think it would ideally be nice for a userspace library developer to be
> able to do subprocess management (without SIGCHLD) in a similar way
> across both platforms, without lots of complicated autoconf shenanigans.
>
> So could we look at the overlap and seeing if we can come up with
> something that covers your requirements and also allows for something
> that looks like FreeBSD's process descriptors?

Agreed; however, I think it's reasonable to provide appropriate Linux
system calls, and then let glibc or libbsd or similar provide the
BSD-compatible calls on top of those.  I don't think the kernel
interface needs to exactly match FreeBSD's, as long as it's a superset
of the functionality.

For example, pdfork can just call clone4 with CLONE_FD and return the
resulting file descriptor.

In my further comments below, I'll suggest ways that the FreeBSD library
calls could be implemented on top of Linux system calls.

> (I've actually got some rough patches to add process descriptor
> functionality on Linux, so I can look at how the two approaches compare
> and contrast.)
> 
> > Note that signalfd for SIGCHLD does not suffice here, because that still
> > receives notification for all child processes, and interferes with process-wide
> > signal handling.
> >
> > The CLONE_FD file descriptor uniquely identifies a process on the system in a
> > race-free way, by holding a reference to the task_struct.  In the future, we
> > may introduce APIs that support using process file descriptors instead of PIDs.
> 
> FreeBSD has pdkill(2) and (theoretically) pdwait4(2) along these lines.
> I suspect we need either need pdkill(2) or a way to retrieve a PID from
> a process file descriptor, so that there's a way to send signals to the
> child.

The original caller of clone4 with CLONE_FD can pass CLONE_PARENT_SETTID
to get the PID.

In the future, I plan to add an fd-based equivalent of
rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
whether to kill a process or thread) which is a superset of pdkill.
pdkill could then call that and just not pass the extra info.

A fair bit of pdwait4 could be implemented on top of read(), other than
the full rusage information (see below), and the ability to wait for
STOP/CONT (which the CLONE_FD file descriptor could support if desired,
but it'd have to be set via a flag at clone time).

I think it's a feature to use read() rather than an additional magic
system call.

> > Introducing CLONE_FD required two additional bits of yak shaving: Since clone
> > has no more usable flags (with the three currently unused flags unusable
> > because old kernels ignore them without EINVAL), also introduce a new clone4
> > system call with more flag bits and an extensible argument structure.  And
> > since the magic pt_regs-based syscall argument processing for clone's tls
> > argument would otherwise prevent introducing a sane clone4 system call, fix
> > that too.
> >
> > I tested the CLONE_SETTLS changes with a thread-local storage test program (two
> > threads independently reading and writing a __thread variable), on both 32-bit
> > and 64-bit, and I observed no issues there.
> 
> Worth preserving in tools/testing/selftests/ ?

Not really; it's just the following trivial program, which was faster to
write than to attempt to find somewhere:

#include <pthread.h>
#include <stdio.h>

__thread unsigned x = 0;

void *thread_func(void *unused)
{
    unsigned *tx = &x;
    for (; *tx < 10; (*tx)++)
        printf("child: tx=%p *tx=%u\n", tx, *tx);
    return NULL;
}

int main(void)
{
    unsigned *tx = &x;
    pthread_t thread;
    pthread_create(&thread, NULL, thread_func, NULL);
    for (; *tx < 10; (*tx)++)
        printf("main: tx=%p *tx=%u\n", tx, *tx);
    pthread_join(thread, NULL);
    return 0;
}

(I didn't bother with error handling, because I ran it under strace.)

> > I tested clone4 and the new CLONE_FD call with several additional test
> > programs, launching either a process or thread (in the former case using
> > syscall(), in the latter case by calling clone4 via assembly and returning to
> > C), sleeping in parent and child to test the case of either exiting first, and
> > then printing the received clone4_info structure.  Thiago also tested clone4
> > with CLONE_FD with a modified version of libqt's process handling, which
> > includes a test suite.
> >
> > I've also included the manpages patch at the end of this series.  (Note that
> > the manpage documents the behavior of the future glibc wrapper as well as the
> > raw syscall.)  Here's a formatted plain-text version of the manpage for
> > reference:
> 
> FYI, I've added some comparisons with the FreeBSD equivalents below.

Thanks!

> > CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)
> >
> >
> >
> > NAME
> >        clone4 - create a child process
> >
> > SYNOPSIS
> >        /* Prototype for the glibc wrapper function */
> >
> >        #define _GNU_SOURCE
> >        #include <sched.h>
> >
> >        int clone4(uint64_t flags,
> >                   size_t args_size,
> >                   struct clone4_args *args,
> >                   int (*fn)(void *), void *arg);
> >
> >        /* Prototype for the raw system call */
> >
> >        int clone4(unsigned flags_high, unsigned flags_low,
> >                   unsigned long args_size,
> >                   struct clone4_args *args);
> >
> >        struct clone4_args {
> >            pid_t *ptid;
> >            pid_t *ctid;
> >            unsigned long stack_start;
> >            unsigned long stack_size;
> >            unsigned long tls;
> >        };
> >
> >
> > DESCRIPTION
> >        clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
> >        clone4() supports additional flags that clone(2) does not, and  accepts
> >        arguments via an extensible structure.
> >
> >        args  points to a clone4_args structure, and args_size must contain the
> >        size of that structure, as understood by the  caller.   If  the  caller
> >        passes  a  shorter  structure  than  the  kernel expects, the remaining
> >        fields will default to 0.  If the caller passes a larger structure than
> >        the  kernel  expects  (such  as one from a newer kernel), clone4() will
> >        return EINVAL.  The clone4_args structure may gain additional fields at
> >        the  end  in  the future, and callers must only pass a size that encom‐
> >        passes the number of fields they understand.  If the  caller  passes  0
> >        for args_size, args is ignored and may be NULL.
> >
> >        In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
> >        tls have the same semantics as they do with clone(2) and clone2(2).
> >
> >        In the glibc wrapper, fn and arg have the same  semantics  as  they  do
> >        with clone(2).  As with clone(2), the underlying system call works more
> >        like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
> >        plifies  thread execution by calling fn(arg) and exiting the child when
> >        that function exits.
> >
> >        The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
> >        flags_low arguments in the kernel interface) accepts all the same flags
> >        as  clone(2),  with  the   exception   of   the   obsolete   CLONE_PID,
> >        CLONE_DETACHED, and CLONE_STOPPED.  In addition, flags accepts the fol‐
> >        lowing flags:
> >
> >
> >        CLONE_FD
> >               Instead of returning a process ID, clone4()  with  the  CLONE_FD
> >               flag  returns a file descriptor associated with the new process.
> >               When the new process exits, the kernel will not send a signal to
> >               the  parent process, and will not keep the new process around as
> >               a "zombie" process  until  a  call  to  waitpid(2)  or  similar.
> >               Instead,  the file descriptor will become available for reading,
> >               and the new process will be immediately reaped.
> 
> Just to confirm: presumably a waitpid(-1,...) call that's already in
> progress won't return when one of these child processes exits?

I agree, I don't think it should.  Because otherwise you'd also assume
you can waitpid() on the PID itself, and that'd be a race condition
since the process autoreaps.

> >               Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
> >               descriptor  returned  by  clone4()  with the CLONE_FD flag works
> >               even with SIGCHLD unblocked in one or more threads of the parent
> >               process,  and  allows the process to have different handlers for
> >               different child processes, such as those created by  a  library,
> >               without  introducing  race conditions around process-wide signal
> >               handling.
> >
> >               clone4() will never return a file descriptor in the range 0-2 to
> >               the caller, to avoid ambiguity with the return of 0 in the child
> >               process.  Only the  calling  process  will  have  the  new  file
> >               descriptor open; the child process will not.
> 
> FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> return the file descriptor separately, which avoids the need for special
> case processing for low FD values (and means that POSIX's "lowest file
> descriptor not currently open" behaviour can be preserved if desired).

That'd be easy to implement if desired, by adding an outbound pointer to
clone4_args.

The (very mild) reason I'd dropped the PID: with CLONE_FD and future
syscalls that use the fd as an identifier, PIDs can hopefully become
mostly unnecessary.  However, I'm not that attached to changing the
return value; it'd be trivial to switch to an outbound parameter
instead, and then drop the "not 0-2".

> >               Since the kernel does not send a termination signal when a child
> >               process created with CLONE_FD exits, the low byte of flags  does
> >               not contain a signal number.  Instead, the low byte of flags can
> >               contain the following additional flags for use with CLONE_FD:
> >
> >
> >               CLONEFD_CLOEXEC
> >                      Set the O_CLOEXEC flag on the new open  file  descriptor.
> >                      See  the description of the O_CLOEXEC flag in open(2) for
> >                      reasons why this may be useful.
> >
> >
> >               CLONEFD_NONBLOCK
> >                      Set the O_NONBLOCK flag on the new open file  descriptor.
> >                      Using  this flag saves extra calls to fcntl(2) to achieve
> >                      the same result.
> >
> >
> >               clone4() with the CLONE_FD flag returns a file  descriptor  that
> >               supports the following operations:
> >
> >               read(2) (and similar)
> >                      When  the  new  process  exits,  reading  from  the  file
> >                      descriptor produces a single clonefd_info structure:
> >
> >                      struct clonefd_info {
> >                          uint32_t code;   /* Signal code */
> >                          uint32_t status; /* Exit status or signal */
> >                          uint64_t utime;  /* User CPU time */
> >                          uint64_t stime;  /* System CPU time */
> >                      };
> 
> Presumably there is no way to get full rusage information for the exited
> process?

I focused on the information available via SIGCHLD.  Even utime and
stime are unnecessary for the primary use case of CLONE_FD, but I
included them because SIGCHLD does.  I'd like to avoid sending the much
larger rusage over the file descriptor when the caller may not care.

However, given that the task_struct sticks around as long as the
CLONE_FD file descriptor does, if that information is normally still
available from a dead-but-not-waited-on process, it should be trivial to
add an operation that takes the file descriptor and returns the full
rusage, if someone needs that.  I think that can be done as part of a
later patch series adding other operations for use with the file
descriptor, though.

> [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> process descriptor, including rusage retrieval.  However, I don't think
> they actually implemented it:
>   http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]

That's a pretty good argument that we don't need to either, at least not
yet.

> >                      If the new process has not  yet  exited,  read(2)  either
> >                      blocks  until  it does, or fails with the error EAGAIN if
> >                      the file descriptor has been made nonblocking.
> >
> >                      Future kernels may extend clonefd_info by appending addi‐
> >                      tional  fields  to  the end.  Callers should read as many
> >                      bytes as they understand; unread data will be  discarded,
> >                      and  subsequent  reads  after  the first will return 0 to
> >                      indicate end-of-file.  Callers requesting more bytes than
> >                      the  kernel  provides  (such as callers expecting a newer
> >                      clonefd_info structure) will receive a shorter  structure
> >                      from older kernels.
> 
> FreeBSD also implements fstat(2) for its process descriptors, although
> only a few of the fields get filled in.

I looked at what they provide, and that seems like more of a novelty
than something particularly useful (since most of the stat fields aren't
meaningful), but if that's useful for compatibility then adding it seems
fine.

> >               poll(2), select(2), epoll(7) (and similar)
> >                      The  file  descriptor  is readable (the select(2) readfds
> >                      argument; the poll(2) POLLIN flag) if the new process has
> >                      exited.
> 
> FreeBSD uses POLLHUP here.

That makes sense given that they provide the information via a separate
call rather than read.  Since the CLONE_FD file descriptor uses read, it
needs to provide POLLIN, but I have no objection to using *both* POLLIN
and POLLHUP if that'd be at all useful.

> >               close(2)
> >                      When  the file descriptor is no longer required it should
> >                      be closed.  If no process has a file descriptor open  for
> >                      the new process, no process will receive any notification
> >                      when the new process exits.  The new process  will  still
> >                      be immediately reaped.
> 
> FreeBSD has two different behaviours for close(2), depending on a flag
> value (PD_DAEMON).  With the flag set it's roughly like this, but
> without PD_DAEMON a close(2) operation on the (last open) file
> descriptor terminates the child process.
> 
> This can be quite useful, particularly for the use case where some
> userspace library has an FD-controlled subprocess -- if the application
> using the library terminates, the process descriptor is closed and so
> the subprocess is automatically terminated.

That's an interesting idea.  I don't think it makes sense for that to be
the default behavior, but if someone wanted to add an additional flag
to implement that behavior, that seems fine.  A FreeBSD-compatible
pdfork could then use that flag when not passed PD_DAEMON and not use it
when passed PD_DAEMON.

How does it kill the process when the last open descriptor closes?
SIGKILL?  SIGTERM?  The former seems unfriendly (preventing graceful
termination), and the latter blockable.  There's a reason init systems
send TERM, then wait, then KILL.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html