> On May 11, 2019, at 10:21 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > >> On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >> >> A better “spawn” API should fix this. > > Andy, stop with the "spawn would be better". It doesn’t have to be spawn per se. But the current situation sucks. > > Notice? None of the real problems are about execve or would be solved > by any spawn API. You just think that because you've apparently been > talking to too many MS people that think fork (and thus indirectly > execve()) is bad process management. > > I’ve literally never spoken to an MS person about it. What container managers and init systems *want* is a way to drop privileges, change namespaces, etc and then run something in a controlled way so that the intermediate states aren’t dangerous. An API for this could be spawn-like or exec-like — that particular distinction is beside the point. Having personally written code that mucks with namepsaces, I've wanted two particular abilities that are both quite awkward: a) Change all my UIDs and GIDs to match a container, enter that container's namespaces, and run some binary in the container's filesystem, all atomically enough that I don't need to worry about accidentally leaking privileges into the container. A super-duper-non-dumpable mode would kind of allow this, but I'd worry that there's some other hole besides ptrace() and /proc/self. b) Change all my UIDs and GIDs to match a container, enter that container's namespaces, and run some binary that is *not* in the container's filesystem. This happens, for example, if the container's mount namespace has no exec mounts at all. We don't have a fantastic way to do this at all right now due to /proc/self/exe. Regardless, the actual CVE at hand would have been nicely avoided if writing to /proc/self/exe didn’t work, and I see no reason we can’t make that happen. I suppose we could also consider a change to disable /proc/self/exe if it's not reachable from /proc/self/root. By "disable", I mean that readlink() should maybe still work, but actually trying to open it could probably fail safely.