https://bugzilla.kernel.org/show_bug.cgi?id=215769 --- Comment #10 from brauner@xxxxxxxxxx --- On Tue, Apr 05, 2022 at 09:28:12PM +0200, Alejandro Colomar wrote: > Hey, Christian! > > On 4/4/22 10:05, Christian Brauner wrote: > > On Sat, Apr 02, 2022 at 11:15:52PM +0200, Alejandro Colomar (man-pages) > wrote: > > > [Added some kernel CCs that may know what's going on] > [...] > > > Maybe someone in the kernel can send some patch for the clone(2) and/or > > > vfork(2) manual pages that explains the reason (if it's intended). > > > > Hey Alejandro, > > > > I won't be able to send a patch very soon but I can at least explain why > > you see EINVAL. :) > > Don't hurry, we're not planning to release any soon :) > > > > > This is intended. > > > > vfork() suspends the parent process and the child process will share the > > same vm as the parent process. If the child process is in a new time > > namespace different from its parent process it is not allowed to be in > > the same threadgroup or share virtual memory with the parent process. > > That's why you see EINVAL. > > That makes a lot of sense to me. > > > > > Note, the unshare(CLONE_NEWTIME) call will _not_ cause the calling > > process to be moved into a different time namespace. Only the newly > > created child process will be after a subsequent > > fork()/vfork()/clone()/clone3()... > > > > The semantics are equivalent to that of CLONE_NEWPID in this regard. You > > can see this via /proc/<pid>/ns/ where you see two entries for pid > > namespaces and also two entries for time namespaces: > > > > * CLONE_NEWTIME > > * /proc/<pid>/ns/time // current time namespace > > * /proc/<pid>/ns/time_for_children // time namespace for the new > child process > > Also makes sense. Michael taught me that a few weeks ago :) > > This also triggers some doubt: will the same problem happen with > CLONE_NEWPID since it also moves the child into a new ns (in this case a PID > one)? See test program below. No, it won't. A pid namespace places no relevant constraints on vm usage whereas a time namespace does. If a task joins a new time namespace it'll clean the VVAR page tables and refault them with the new layout after the timens change. That affects all tasks which use the same task->mm. Since CLONE_THREAD implies CLONE_VM this would affect the whole thread-group behind their back. All threads would suddenly change timens. No such issues exist for pid namespaces; they don't need to alter task->mm. > > > > > If during fork: > > > > parent_process->time != parent_process->time_for_children > > > > and either CLONE_VM or CLONE_THREAD is set you see EINVAL. > > > > You can thus replicate the same error via: > > > > unshare(CLONE_NEWTIME) > > > > and a > > > > clone() or clone3() call with CLONE_VM or CLONE_THREAD. > > So, to test my doubts, I wrote this similar program (and also similar > programs where only the CLONE_NEW* flag was changed, one with CLONE_NEWTIME, > and one with CLONE_NEWNS)): > > $ cat vfork_newpid.c > #define _GNU_SOURCE > #include <err.h> > #include <errno.h> > #include <linux/sched.h> > #include <sched.h> > #include <signal.h> > #include <stdio.h> > #include <stdlib.h> > #include <sys/syscall.h> > #include <unistd.h> > > static char *const child_argv[] = { > "print_pid", > NULL > }; > > static char *const child_envp[] = { > NULL > }; > > int > main(void) > { > pid_t pid; > > printf("%s: PID: %ld\n", program_invocation_short_name, (long) > getpid()); > > if (unshare(CLONE_NEWPID) == -1) > err(EXIT_FAILURE, "unshare(2)"); > if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) > err(EXIT_FAILURE, "signal(2)"); > > pid = syscall(SYS_vfork); > //pid = vfork(); // This behaves differently. > switch (pid) { > case 0: > execve("/home/alx/tmp/print_pid", child_argv, child_envp); > err(EXIT_SUCCESS, "PID %jd exiting after execve(2)", > (long) getpid()); > case -1: > err(EXIT_FAILURE, "vfork(2)"); > default: > errx(EXIT_SUCCESS, "Parent exiting after vfork(2)."); > } > } > > $ cat print_pid.c > #include <err.h> > #include <stdlib.h> > #include <unistd.h> > > int > main(void) > { > errx(EXIT_SUCCESS, "PID %jd exiting.", (long) getpid()); > } > > $ cc -Wall -Wextra -Werror -o print_pid print_pid.c > $ cc -Wall -Wextra -Werror -o vfork_newpid vfork_newpid.c > $ > $ > $ sudo ./vfork_newpid > vfork_newpid: PID: 8479 > vfork_newpid: PID 8479 exiting after execve(2): Success > print_pid: PID 1 exiting. > $ > $ > $ sudo ./vfork_newtime > vfork_newtime: PID: 8484 > vfork_newtime: vfork(2): Invalid argument > $ > $ > $ sudo ./vfork_newns > vfork_newns: PID: 8486 > vfork_newns: PID 8486 exiting after execve(2): Success > print_pid: PID 8487 exiting. > > > The first thing I noted is that usage of vfork(2) differs considerably from > fork(2), and that's something that's not clear by reading the manual page. > It sais that the parent process is suspended until the child calls > execve(2), but I expected it to mean that vfork(2) doesn't return to the > parent until that happened, but was otherwise transparent. I was wrong and > my tests showed me that. > > I was going to propose an example program for the manual page, when I > decided to try a slightly different thing: call vfork() instead of > syscall(SYS_vfork); that changed the behavior to the same one as with > fork(2) (i.e., the parent resumes after vfork(2) returns the PID of the > child. > > Is that also intended? I couldn't find the glibc wrapper source code, so I > don't know what is glibc doing here, but I straced the processes, and > they're all calling vfork(), so the behavior should be consistent; it's > quite weird. I'm very confused at this point. glibc does vfork() via inline assembly massaging. There's probably atfork handlers and a bunch of other stuff involved so it's difficult to do a remote diagnosis. (And note that calling anything other than execve() or _exit() after vfork() is basically undefined behavior.) > > > I'm also wondering why it's okay to have processes in different PID ns share > the same vm, but I guess that's implementation details that I don't need to > care that much. See earlier in the thread. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.