[Bug 215769] man 2 vfork() does not document corner case when PID == 1

bugzilla-daemon@xxxxxxxxxx · Wed, 06 Apr 2022 08:46:20 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=215769

--- Comment #10 from brauner@xxxxxxxxxx ---
On Tue, Apr 05, 2022 at 09:28:12PM +0200, Alejandro Colomar wrote:
> Hey, Christian!
> 
> On 4/4/22 10:05, Christian Brauner wrote:
> > On Sat, Apr 02, 2022 at 11:15:52PM +0200, Alejandro Colomar (man-pages)
> wrote:
> > > [Added some kernel CCs that may know what's going on]
> [...]
> > > Maybe someone in the kernel can send some patch for the clone(2) and/or
> > > vfork(2) manual pages that explains the reason (if it's intended).
> > 
> > Hey Alejandro,
> > 
> > I won't be able to send a patch very soon but I can at least explain why
> > you see EINVAL. :)
> 
> Don't hurry, we're not planning to release any soon :)
> 
> > 
> > This is intended.
> > 
> > vfork() suspends the parent process and the child process will share the
> > same vm as the parent process. If the child process is in a new time
> > namespace different from its parent process it is not allowed to be in
> > the same threadgroup or share virtual memory with the parent process.
> > That's why you see EINVAL.
> 
> That makes a lot of sense to me.
> 
> > 
> > Note, the unshare(CLONE_NEWTIME) call will _not_ cause the calling
> > process to be moved into a different time namespace. Only the newly
> > created child process will be after a subsequent
> > fork()/vfork()/clone()/clone3()...
> > 
> > The semantics are equivalent to that of CLONE_NEWPID in this regard. You
> > can see this via /proc/<pid>/ns/ where you see two entries for pid
> > namespaces and also two entries for time namespaces:
> > 
> > * CLONE_NEWTIME
> >    * /proc/<pid>/ns/time                    // current time namespace
> >    * /proc/<pid>/ns/time_for_children       // time namespace for the new
> child process
> 
> Also makes sense.  Michael taught me that a few weeks ago :)
> 
> This also triggers some doubt:  will the same problem happen with
> CLONE_NEWPID since it also moves the child into a new ns (in this case a PID
> one)?  See test program below.

No, it won't. A pid namespace places no relevant constraints on vm usage
whereas a time namespace does.
If a task joins a new time namespace it'll clean the VVAR page tables
and refault them with the new layout after the timens change. That
affects all tasks which use the same task->mm.

Since CLONE_THREAD implies CLONE_VM this would affect the whole
thread-group behind their back. All threads would suddenly change
timens.

No such issues exist for pid namespaces; they don't need to alter
task->mm.

> 
> > 
> > If during fork:
> > 
> > parent_process->time != parent_process->time_for_children
> > 
> > and either CLONE_VM or CLONE_THREAD is set you see EINVAL.
> > 
> > You can thus replicate the same error via:
> > 
> > unshare(CLONE_NEWTIME)
> > 
> > and a
> > 
> > clone() or clone3() call with CLONE_VM or CLONE_THREAD.
> 
> So, to test my doubts, I wrote this similar program (and also similar
> programs where only the CLONE_NEW* flag was changed, one with CLONE_NEWTIME,
> and one with CLONE_NEWNS)):
> 
> $ cat vfork_newpid.c
> #define _GNU_SOURCE
> #include <err.h>
> #include <errno.h>
> #include <linux/sched.h>
> #include <sched.h>
> #include <signal.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> 
> static char *const child_argv[] = {
>       "print_pid",
>       NULL
> };
> 
> static char *const child_envp[] = {
>       NULL
> };
> 
> int
> main(void)
> {
>       pid_t pid;
> 
>       printf("%s: PID: %ld\n", program_invocation_short_name, (long)
> getpid());
> 
>       if (unshare(CLONE_NEWPID) == -1)
>               err(EXIT_FAILURE, "unshare(2)");
>       if (signal(SIGCHLD, SIG_IGN) == SIG_ERR)
>               err(EXIT_FAILURE, "signal(2)");
> 
>       pid = syscall(SYS_vfork);
>       //pid = vfork();  // This behaves differently.
>       switch (pid) {
>       case 0:
>               execve("/home/alx/tmp/print_pid", child_argv, child_envp);
>               err(EXIT_SUCCESS, "PID %jd exiting after execve(2)",
>                   (long) getpid());
>       case -1:
>               err(EXIT_FAILURE, "vfork(2)");
>       default:
>               errx(EXIT_SUCCESS, "Parent exiting after vfork(2).");
>       }
> }
> 
> $ cat print_pid.c
> #include <err.h>
> #include <stdlib.h>
> #include <unistd.h>
> 
> int
> main(void)
> {
>       errx(EXIT_SUCCESS, "PID %jd exiting.", (long) getpid());
> }
> 
> $ cc -Wall -Wextra -Werror -o print_pid print_pid.c
> $ cc -Wall -Wextra -Werror -o vfork_newpid vfork_newpid.c
> $
> $
> $ sudo ./vfork_newpid
> vfork_newpid: PID: 8479
> vfork_newpid: PID 8479 exiting after execve(2): Success
> print_pid: PID 1 exiting.
> $
> $
> $ sudo ./vfork_newtime
> vfork_newtime: PID: 8484
> vfork_newtime: vfork(2): Invalid argument
> $
> $
> $ sudo ./vfork_newns
> vfork_newns: PID: 8486
> vfork_newns: PID 8486 exiting after execve(2): Success
> print_pid: PID 8487 exiting.
> 
> 
> The first thing I noted is that usage of vfork(2) differs considerably from
> fork(2), and that's something that's not clear by reading the manual page.
> It sais that the parent process is suspended until the child calls
> execve(2), but I expected it to mean that vfork(2) doesn't return to the
> parent until that happened, but was otherwise transparent.  I was wrong and
> my tests showed me that.
> 
> I was going to propose an example program for the manual page, when I
> decided to try a slightly different thing: call vfork() instead of
> syscall(SYS_vfork);  that changed the behavior to the same one as with
> fork(2) (i.e., the parent resumes after vfork(2) returns the PID of the
> child.
> 
> Is that also intended?  I couldn't find the glibc wrapper source code, so I
> don't know what is glibc doing here, but I straced the processes, and
> they're all calling vfork(), so the behavior should be consistent; it's
> quite weird.  I'm very confused at this point.

glibc does vfork() via inline assembly massaging. There's probably
atfork handlers and a bunch of other stuff involved so it's difficult to
do a remote diagnosis.
(And note that calling anything other than execve() or _exit() after
vfork() is basically undefined behavior.)

> 
> 
> I'm also wondering why it's okay to have processes in different PID ns share
> the same vm, but I guess that's implementation details that I don't need to
> care that much.

See earlier in the thread.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.