On Sat, Oct 6, 2018 at 10:56 PM Florian Weimer <fw@xxxxxxxxxxxxx> wrote: > > * Aleksa Sarai: > > > On 2018-10-01, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > >> >>> Currently most container runtimes try to do this resolution in > >> >>> userspace[1], causing many potential race conditions. In addition, the > >> >>> "obvious" alternative (actually performing a {ch,pivot_}root(2)) > >> >>> requires a fork+exec which is *very* costly if necessary for every > >> >>> filesystem operation involving a container. > >> >> > >> >> Wait. fork() I understand, but why exec? And actually, you don't need > >> >> a full fork() either, clone() lets you do this with some process parts > >> >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > >> >> the file descriptor table shared. And why chroot()/pivot_root(), > >> >> wouldn't you want to use setns()? > >> > > >> > You're right about this -- for C runtimes. In Go we cannot do a raw > >> > clone() or fork() (if you do it manually with RawSyscall you'll end with > >> > broken runtime state). So you're forced to do fork+exec (which then > >> > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > >> > for CLONE_VFORK. > >> > >> I must admit that I’m not very sympathetic to the argument that “Go’s > >> runtime model is incompatible with the simpler solution.” > > > > Multi-threaded programs have a similar issue (though with Go it's much > > worse). If you fork a multi-threaded C program then you can only safely > > use AS-Safe glibc functions (those that are safe within a signal > > handler). But if you're just doing three syscalls this shouldn't be as > > big of a problem as Go where you can't even do said syscalls. > > The situation is a bit more complicated. There are many programs out > there which use malloc and free (at least indirectly) after a fork, > and we cannot break them. In glibc, we have a couple of subsystems > which are put into a known state before calling the fork/clone system > call if the application calls fork. The price we pay for that is a > fork which is not POSIX-compliant because it is not async-signal-safe. > Admittedly, other libcs chose different trade-offs. > > However, what is the same across libcs is this: You cannot call the > clone system call directly and get a fully working new process. Some > things break. For example, for recursive mutexes, we need to know the > TID of the current thread, and we cannot perform a system call to get > it for performance reasons. So everyone has a TID cache for that. > But the TID cache does not get reset when you bypass the fork > implementation in libc, so you end up with subtle corruption bugs on > TID reuse. Sure, but recursive mutexes etc. are very specific use-case. I'd even go so far to say that if you use mutexes + threads and then also fork in those threads you're hosed anyway. If you don't things get a little cleaner assuming you don't call library functions that use mutexes internally. Event then you might (sometimes at least) still get around most problems with atfork handlers (thought I really don't like him). But you know more about this then I do. :) > > So I'd say that in most cases, the C situation is pretty much the same > as the Go situation. If I recall correctly, the problem for Go is > that it cannot call setns from Go code because it fails in the kernel > for multi-threaded processes, and Go processes are already > multi-threaded when user Go code runs. That is true for *some* namespaces (user, mount) but not for all. For example, setns(CLONE_NEWNET) would be fine from go. But the go runtime thinks it's clever to clone a new thread in between entry and exit of a syscall. If you switch namespaces you might end up with a new thread that belongs to the wrong namespace which is very problematic. So you can either rely on calling some go magic that locks you to a specific os thread but that does only work in later go versions or you go the constructor route, i.e. you e.g. implement a (dummy) subcommand that you can call and that triggers the execution of a C function that is marked with __attribute__((constructor)) that runs before the go runtime and in which you can do setns(), fork() and friends (somewhat) safely. This has very bad performance and is a nasty hack but it's really unavoidable.