Hello Christian, On 4/27/20 4:36 PM, Christian Brauner wrote: > For quite a while we have been thinking about using pidfds to attach to > namespaces. (Sounds promising.) > This patchset has existed for about a year already but we've > wanted to wait to see how the general api would be received and adopted. > Now that more and more programs in userspace have started using pidfds > for process management it's time to send this one out. > > This patch makes it possible to use pidfds to attach to the namespaces > of another process, i.e. they can be passed as the first argument to the > setns() syscall. When only a single namespace type is specified the > semantics are equivalent to passing an nsfd. That means > setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, > when a pidfd is passed, multiple namespace flags can be specified in the > second setns() argument and setns() will attach the caller to all the > specified namespaces all at once or to none of them. While I think I understand what the intended semantics are, the description in the previous paragraph feels off, so that if this whole text lands in a commit message (or a manual page), I think it needs fixing. Firs, it seems odd to say that "setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET)" setns(nsfd, CLONE_NEWNET) means: fail if nsfd does not refer to a network namespace. setns(pidfd, CLONE_NEWNET) means: move into just the network namespace of the process referred to by 'pidfd'. I would not call those two things "equal", in a semantic sense. And then: > If 0 is specified > together with a pidfd then setns() will interpret it the same way 0 is > interpreted together with a nsfd argument, i.e. attach to any/all > namespaces. If I understand right, setns(pidfd, 0) would mean: move into all of the same namespaces as the process referred to by 'pidfd'. But setns(nsfd, 0) means: move into whatever kind of namespace is referred to by 'nsfd'. I would not say of these two cases that 0 is interpreted in the same way. Hopefully I have not misunderstood. > The obvious example where this is useful is a standard container > manager interacting with a running container: pushing and pulling files > or directories, injecting mounts, attaching/execing any kind of process, > managing network devices all these operations require attaching to all > or at least multiple namespaces at the same time. Given that nowadays > most containers are spawned with all namespaces enabled we're currently > looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns> > nsfds, another 7 to actually perform the namespace switch. With time > namespaces we're looking at about 16 syscalls. > (We could amortize the first 7 or 8 syscalls for opening the nsfds by > stashing them in each container's monitor process but that would mean > we need to send around those file descriptors through unix sockets > everytime we want to interact with the container or keep on-disk > state. Even in scenarios where a caller wants to join a particular > namespace in a particular order callers still profit from batching > other namespaces. That mostly applies to the user namespace but > all container runtimes I found join the user namespace first no matter > if it privileges or deprivileges the container.) > With pidfds this becomes a single syscall no matter how many namespaces > are supposed to be attached to. That does seem like a win. Thanks for working on this! Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers