Re: [PATCH 1/2] fs/exec: allow to unshare a time namespace on vfork+exec

Florian Weimer <fweimer@xxxxxxxxxx> · Wed, 15 Jun 2022 10:14:19 +0200

* Christian Brauner:

> For pid namespaces one problem would be that it could end up confusing a
> process about its own pid. This was a more serious problem when the pid
> cache was still active in glibc; but fwiw systemd still has a pid cache
> afair.

Right.  glibc still has a TID cache, mainly for use with recursive
mutexes (where we need a 32-bit thread identifier and can't perform a
system call on every locking operation for performance reasons).
Assuming that a non-delayed CLONE_NEWPID would also change the TID
underneath us, we'd have subtly broken recursive mutexes.

vfork gets away with not updating the TID cache (which is shared with
the parent process) because the parent process is suspended while the
new subprocess is still running and has not execve'ed yet.

Now one could argue that calling unshare automatically means that you
must not call any glibc functions afterwards (similar to thread-creating
clone), or at least that you cannot call any functions which are not
async-signal-safe, but that does not match existing application
practice.  And I think we actually prefer that file servers call chroot
after unshare(CLONE_FS), rather than trying to reimplement restricted
pathname lookup in userspace.

Thanks,
Florian