Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 29 Apr 2019 19:16:11 -0700

On Mon, Apr 29, 2019 at 5:39 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> ... uuuh, whoops. Turns out I don't know what I'm talking about.

Well, apparently there's some odd libc issue accoprding to Florian, so
there *might* be something to it.

> Nevermind. For some reason I thought vfork() was just
> CLONE_VFORK|SIGCHLD, but now I see I got that completely wrong.

Well, inside the kernel, that's actually *very* close to what vfork() is:

  SYSCALL_DEFINE0(vfork)
  {
        return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
                        0, NULL, NULL, 0);
  }

but that's just an internal implementation detail. It's a real vfork()
and should act as the traditional BSD "share everything" without any
address space copying. The CLONE_VFORK flag is what does the "wait for
child to exit or execve" magic.

Note that vfork() is "exciting" for the compiler in much the same way
"setjmp/longjmp()" is, because of the shared stack use in the child
and the parent. It is *very* easy to get this wrong and cause massive
and subtle memory corruption issues because the parent returns to
something that has been messed up by the child.

That may be why some libc might end up just using "fork()", because it
ends up avoiding bugs in user space.
(In fact, if I recall correctly, the _reason_ we have an explicit
'vfork()' entry point rather than using clone() with magic parameters
was that the lack of arguments meant that you didn't have to
save/restore any registers in user space, which made the whole stack
issue simpler. But it's been two decades, so my memory is bitrotting).

Also, particularly if you have a big address space, vfork()+execve()
can be quite a bit faster than fork()+execve(). Linux fork() is pretty
efficient, but if you have gigabytes of VM space to copy, it's going
to take time even if you do it fairly well.

               Linus