On Mon, Apr 15, 2019 at 2:25 PM Oleg Nesterov <oleg@xxxxxxxxxx> wrote: > > On 04/15, Christian Brauner wrote: > > > > > CLONE_PARENT_SETTID doesn't look very usefule, so what if we add > > > > > > if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) == > > > (CLONE_PIDFD|CLONE_PARENT_SETTID)) > > > return ERR_PTR(-EINVAL); > > > > > > at the start of copy_process() ? > > > > > > Then it can do > > > > > > if (clone_flags & CLONE_PIDFD) { > > > retval = pidfd_create(pid, &pidfdf); > > > if (retval < 0) > > > goto bad_fork_free_pid; > > > retval = put_user(retval, parent_tidptr) > > > if (retval < 0) > > > goto bad_fork_free_pid; > > > } > > > > Uhhh Oleg, that is nifty. I have to say I like that a lot. This would > > let us return the pid and the pidfd in one go and we can also start > > pidfd numbering at 0. > > Christian, sorry if it was already discussed, but I can't force myself to > read all the previous discussions ;) > > If we forget about CONFIG_PROC_FS, why do we really want to create a file? > > > Suppose we add a global u64 counter incremented by copy_process and reported > in /proc/$pid/status. Suppose that clone(CLONE_PIDFD) writes this counter to > *parent_tidptr. Let's denote this counter as UNIQ_PID. > > Now, if you want to (say) safely kill a task and you have its UNIQ_PID, you > can do > > kill_by_pid_uniq(int pid, u64 uniq_pid) > { > pidfd = open("/proc/$pid", O_DIRECTORY); > > status = openat(pidfd, "status"); > u64 this_uniq_pid = ... read UNIQ_PID from status ...; > > if (uniq_pid != this_uniq_pid) > return; > > pidfd_send_signal(pidfd); > } > > Why else do we want pidfd? Apart from what others have already pointed out, there are two other things I am looking forward to: * Currently, when ptracing from a thread, waitpid means that I need to block or constantly loop over with pauses to receive the ptrace related results, since ptrace is thread directed (and to be able to poll other event sources as well, eg. to receive further commands over a pipe/message passing fd), and related waitpid responses only arrive to the attached thread. The waitfd patchset was rejected on the grounds that one could use a separate thread to do the waitpid while polling from the attached thread or a new thread, but due to ptrace this is false. pidfds would allow for this to work (this does mean we'd also need to be able to return one at ATTACH/SEIZE time, though). Note that waitid and other variants throw away a lot of needed information. * Descriptors mean you can optionally choose to bind your privileges to the file descriptor and then pass it around to others. They do not work this way now but the choice of such an extension has been kept open. One of the examples is binding one's CAP_KILL capability and then pass it to another process, so that it can freely signal the said process (and only that), or be able to optionally poke holes in the restrictions imposed by PID namespaces (possibly in the future), etc. > > Oleg. >