For quite a while we have been thinking about using pidfds to attach to namespaces. This patchset has existed for about a year already but we've wanted to wait to see how the general api would be received and adopted. Now that more and more programs in userspace have started using pidfds for process management it's time to send this one out. This patch makes it possible to use pidfds to attach to the namespaces of another process, i.e. they can be passed as the first argument to the setns() syscall. When only a single namespace type is specified the semantics are equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, when a pidfd is passed, multiple namespace flags can be specified in the second setns() argument and setns() will attach the caller to all the specified namespaces all at once or to none of them. If 0 is specified together with a pidfd then setns() will interpret it the same way 0 is interpreted together with a nsfd argument, i.e. attach to any/all namespaces. The obvious example where this is useful is a standard container manager interacting with a running container: pushing and pulling files or directories, injecting mounts, attaching/execing any kind of process, managing network devices all these operations require attaching to all or at least multiple namespaces at the same time. Given that nowadays most containers are spawned with all namespaces enabled we're currently looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns> nsfds, another 7 to actually perform the namespace switch. With time namespaces we're looking at about 16 syscalls. (We could amortize the first 7 or 8 syscalls for opening the nsfds by stashing them in each container's monitor process but that would mean we need to send around those file descriptors through unix sockets everytime we want to interact with the container or keep on-disk state. Even in scenarios where a caller wants to join a particular namespace in a particular order callers still profit from batching other namespaces. That mostly applies to the user namespace but all container runtimes I found join the user namespace first no matter if it privileges or deprivileges the container.) With pidfds this becomes a single syscall no matter how many namespaces are supposed to be attached to. A decently designed, large-scale container manager usually isn't the parent of any of the containers it spawns so the containers don't die when it crashes or needs to update or reinitialize. This means that for the manger to interact with containers through pids is inherently racy especially on systems where the maximum pid number is not signficianly bumped. This is even more problematic since we often spawn and manage thousands or ten-thousands of containers. Interacting with a container through a pid thus can become risky quite quickly. Especially since we allow for an administrator to enable advanced features such as syscall interception where we're performing syscalls in lieu of the container. In all of those cases we use pidfds if they are available and we pass them around as stable references. Using them to setns() to the target process namespaces is as reliable as using nsfds. Either the target process is already dead and we get ESRCH or we manage to attach to its namespaces but we can't accidently attach to another process' namespaces. So pidfds lend themselves to be used with this api. Apart from significiantly reducing the number of syscalls from double digit to single digit which is a decent reason post-spectre/meltdown this also allows to switch to a set of namespaces atomically, i.e. either attaching to all the specified namespaces succeeds or we fail. If we fail we haven't changed a single namespace. There are currently three namespaces that can fail (other than for ENOMEM which really is not very interesting since we then have other problems anyway) for non-trivial reasons, user, mount, and pid namespaces. We can fail to attach to a pid namespace if it is not our current active pid namespace or a descendant of it. We can fail to attach to a user namespace because we are multi-threaded, because our current mount namespace shares filesystem state with other tasks, or because we're trying to setns() to the same user namespace, i.e. the target task has the same user namespace as we do. We can fail to attach to a mount namespace because it shares filesystem state with other tasks or because we fail to lookup the new root for the new mount namespace. In most non-pathological scenarios these issues can be somewhat mitigated. But there's e.g. still an inherent race between trying to setns() to the mount namespace of a task and that task spawning a child with CLONE_FS. If that process runs in a new user namespace we must have already setns()ed into the new user namespace otherwise we fail to attach to the mount namespace. There are other cases similar to that and we've had issues where we're half-attached to some namespace and failing in the middle. I've talked about some of these problem during the hallway track (something only the pre-COVID-19 generation will remember) of Plumber in Los Angeles in 2018(?). Even if all these issues could be avoided with super careful userspace coding it would be nicer to have this done in-kernel. There's not a lot of cost associated with this extension for the kernel and pidfds seem to lend themselves nicely for this. Cc: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> Cc: Serge Hallyn <serge@xxxxxxxxxx> Cc: Aleksa Sarai <cyphar@xxxxxxxxxx> Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx> --- If we agree that this is useful than I'd pick this up for for v5.8. There's probably some smart trick around nsproxy and pidns life-cycle management that I'm missing but I tried to be conservative wrt to taking references. --- fs/namespace.c | 5 ++ fs/nsfs.c | 7 +- include/linux/mnt_namespace.h | 2 + include/linux/proc_fs.h | 6 ++ kernel/nsproxy.c | 132 +++++++++++++++++++++++++++++++--- 5 files changed, 143 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a28e4db075ed..1b120e134ea0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1733,6 +1733,11 @@ static struct mnt_namespace *to_mnt_ns(struct ns_common *ns) return container_of(ns, struct mnt_namespace, ns); } +struct ns_common *mnt_ns_to_common(struct mnt_namespace *mnt) +{ + return &mnt->ns; +} + static bool mnt_ns_loop(struct dentry *dentry) { /* Could bind mounting the mount namespace inode cause a diff --git a/fs/nsfs.c b/fs/nsfs.c index 4f1205725cfe..b023c1a367c8 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -229,6 +229,11 @@ int ns_get_name(char *buf, size_t size, struct task_struct *task, return res; } +bool proc_ns_file(const struct file *file) +{ + return file->f_op == &ns_file_operations; +} + struct file *proc_ns_fget(int fd) { struct file *file; @@ -237,7 +242,7 @@ struct file *proc_ns_fget(int fd) if (!file) return ERR_PTR(-EBADF); - if (file->f_op != &ns_file_operations) + if (!proc_ns_file(file)) goto out_invalid; return file; diff --git a/include/linux/mnt_namespace.h b/include/linux/mnt_namespace.h index 35942084cd40..664dd3b06f34 100644 --- a/include/linux/mnt_namespace.h +++ b/include/linux/mnt_namespace.h @@ -6,10 +6,12 @@ struct mnt_namespace; struct fs_struct; struct user_namespace; +struct ns_common; extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *, struct user_namespace *, struct fs_struct *); extern void put_mnt_ns(struct mnt_namespace *ns); +extern struct ns_common *mnt_ns_to_common(struct mnt_namespace *); extern const struct file_operations proc_mounts_operations; extern const struct file_operations proc_mountinfo_operations; diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index 45c05fd9c99d..acfd5012db4e 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -104,6 +104,7 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo proc_write_t write, void *data); extern struct pid *tgid_pidfd_to_pid(const struct file *file); +extern bool proc_ns_file(const struct file *file); #ifdef CONFIG_PROC_PID_ARCH_STATUS /* @@ -159,6 +160,11 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file) return ERR_PTR(-EBADF); } +static inline bool proc_ns_file(const struct file *file) +{ + return false; +} + #endif /* CONFIG_PROC_FS */ struct net; diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index ed9882108cd2..9bc211009a29 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -19,6 +19,7 @@ #include <net/net_namespace.h> #include <linux/ipc_namespace.h> #include <linux/time_namespace.h> +#include <linux/proc_fs.h> #include <linux/proc_ns.h> #include <linux/file.h> #include <linux/syscalls.h> @@ -257,21 +258,133 @@ void exit_task_namespaces(struct task_struct *p) switch_task_namespaces(p, NULL); } -SYSCALL_DEFINE2(setns, int, fd, int, nstype) +static int check_setns_flags(unsigned long flags) +{ + if (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET | + CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWCGROUP)) + return -EINVAL; + + return 0; +} + +static inline bool wants_ns(int flags, int ns) +{ + return !flags || (flags & ns) > 0; +} + +static inline int __ns_install(struct nsproxy *nsproxy, struct ns_common *ns) +{ + return ns->ops->install(nsproxy, ns); +} + +/* + * Ordering is equivalent to the standard ordering used everywhere + * else during unshare and process creation. + */ +static int ns_install(struct nsproxy *nsproxy, struct pid *pid, int flags) +{ + int ret = 0; + struct task_struct *tsk; + struct nsproxy *nsp; + + tsk = get_pid_task(pid, PIDTYPE_PID); + if (!tsk) + return -ESRCH; + + get_nsproxy(tsk->nsproxy); + nsp = tsk->nsproxy; + +#ifdef CONFIG_USER_NS + if (wants_ns(flags, CLONE_NEWUSER)) { + struct user_namespace *user_ns; + + user_ns = get_user_ns(__task_cred(tsk)->user_ns); + ret = __ns_install(nsproxy, &user_ns->ns); + put_user_ns(user_ns); + } +#else + if (flags & CLONE_NEWUSER) + ret = -EINVAL; +#endif + + if (!ret && wants_ns(flags, CLONE_NEWNS)) + ret = __ns_install(nsproxy, mnt_ns_to_common(nsp->mnt_ns)); + +#ifdef CONFIG_UTS_NS + if (!ret && wants_ns(flags, CLONE_NEWUTS)) + ret = __ns_install(nsproxy, &nsp->uts_ns->ns); +#else + if (flags & CLONE_NEWUTS) + ret = -EINVAL; +#endif + +#ifdef CONFIG_IPC_NS + if (!ret && wants_ns(flags, CLONE_NEWIPC)) + ret = __ns_install(nsproxy, &nsp->ipc_ns->ns); +#else + if (flags & CLONE_NEWIPC) + ret = -EINVAL; +#endif + +#ifdef CONFIG_PID_NS + if (!ret && wants_ns(flags, CLONE_NEWPID)) { + struct pid_namespace *pidns; + + pidns = task_active_pid_ns(tsk); + if (pidns) { + get_pid_ns(pidns); + ret = __ns_install(nsproxy, &pidns->ns); + put_pid_ns(pidns); + } + } +#else + if (flags & CLONE_NEWPID) + ret = EINVAL; +#endif + +#ifdef CONFIG_CGROUPS + if (!ret && wants_ns(flags, CLONE_NEWCGROUP)) + ret = __ns_install(nsproxy, &nsp->cgroup_ns->ns); +#else + if (flags & CLONE_NEWCGROUP) + ret = EINVAL; +#endif + +#ifdef CONFIG_NET_NS + if (!ret && wants_ns(flags, CLONE_NEWNET)) + ret = __ns_install(nsproxy, &nsp->net_ns->ns); +#else + if (flags & CLONE_NEWNET) + ret = -EINVAL; +#endif + + put_task_struct(tsk); + put_nsproxy(nsp); + + return ret; +} + +SYSCALL_DEFINE2(setns, int, fd, int, flags) { struct task_struct *tsk = current; struct nsproxy *new_nsproxy; struct file *file; - struct ns_common *ns; + struct ns_common *ns = NULL; int err; - file = proc_ns_fget(fd); - if (IS_ERR(file)) - return PTR_ERR(file); + file = fget(fd); + if (!file) + return -EBADF; err = -EINVAL; - ns = get_proc_ns(file_inode(file)); - if (nstype && (ns->ops->type != nstype)) + if (proc_ns_file(file)) { + ns = get_proc_ns(file_inode(file)); + if (!flags || (ns->ops->type == flags)) + err = 0; + } else if (pidfd_pid(file)) { + err = check_setns_flags(flags); + } + if (err) goto out; new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs); @@ -280,7 +393,10 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype) goto out; } - err = ns->ops->install(new_nsproxy, ns); + if (proc_ns_file(file)) + err = ns->ops->install(new_nsproxy, ns); + else + err = ns_install(new_nsproxy, file->private_data, flags); if (err) { free_nsproxy(new_nsproxy); goto out; base-commit: ae83d0b416db002fe95601e7f97f64b59514d936 -- 2.26.2 _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers