On Mon, Apr 27, 2020 at 4:47 PM Christian Brauner <christian.brauner@xxxxxxxxxx> wrote: > For quite a while we have been thinking about using pidfds to attach to > namespaces. This patchset has existed for about a year already but we've > wanted to wait to see how the general api would be received and adopted. > Now that more and more programs in userspace have started using pidfds > for process management it's time to send this one out. You can already reliably switch to a specific namespace of another process like this given a pidfd and the pid of the process (which, if you don't have it, you can get via fdinfo), right? int switch_ns_to(int pidfd, int pid, char *nstypename) { char ns_path[100]; snprintf(ns_path, sizeof(ns_path), "/proc/%d/ns/%s", pid, nstypename); int fd = open(ns_path, O_RDONLY|O_CLOEXEC); int errno_after_open = errno; if (pidfd_send_signal(pidfd, 0, NULL, 0)) return -1; if (fd == -1) { errno = errno_after_open; return -1; } int ret = setns(fd, 0); close(fd); return ret; } > This patch makes it possible to use pidfds to attach to the namespaces > of another process, i.e. they can be passed as the first argument to the > setns() syscall. When only a single namespace type is specified the > semantics are equivalent to passing an nsfd. This introduces a difference in terms of security: With the old API, you need PTRACE_MODE_READ_FSCREDS on the task whose namespace you're attaching to (to dereference the link /proc/*/ns/*) *AND* whatever access checks the namespace itself enforces (always includes a check for CAP_SYS_ADMIN on the namespace). The ptrace check has the advantage, among other things, that it allows an LSM to see the relationship between the task that's accessing the namespace (subject) and the task whose namespace is being accessed (object). I feel slightly twitchy about this relaxation, and I'm wondering whether we can add a ptrace access check analogous to what you'd have needed via procfs. > That means > setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, > when a pidfd is passed, multiple namespace flags can be specified in the > second setns() argument and setns() will attach the caller to all the > specified namespaces all at once or to none of them. If 0 is specified > together with a pidfd then setns() will interpret it the same way 0 is > interpreted together with a nsfd argument, i.e. attach to any/all > namespaces. [...] > Apart from significiantly reducing the number of syscalls from double > digit to single digit which is a decent reason post-spectre/meltdown > this also allows to switch to a set of namespaces atomically, i.e. > either attaching to all the specified namespaces succeeds or we fail. Apart from the issues I've pointed out below, I think it's worth calling out explicitly that with the current design, the switch will not, in fact, be fully atomic - the process will temporarily be in intermediate stages where the switches to some namespaces have completed while the switches to other namespaces are still pending; and while there will be less of these intermediate stages than before, it also means that they will be less explicit to userspace. [...] > diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c [...] > +/* > + * Ordering is equivalent to the standard ordering used everywhere > + * else during unshare and process creation. > + */ > +static int ns_install(struct nsproxy *nsproxy, struct pid *pid, int flags) > +{ > + int ret = 0; > + struct task_struct *tsk; > + struct nsproxy *nsp; > + > + tsk = get_pid_task(pid, PIDTYPE_PID); > + if (!tsk) > + return -ESRCH; > + > + get_nsproxy(tsk->nsproxy); > + nsp = tsk->nsproxy; How is this correct? Are you holding any locks that protect tsk->nsproxy? > +#ifdef CONFIG_USER_NS > + if (wants_ns(flags, CLONE_NEWUSER)) { > + struct user_namespace *user_ns; > + > + user_ns = get_user_ns(__task_cred(tsk)->user_ns); > + ret = __ns_install(nsproxy, &user_ns->ns); If ret == 0, then at this point you've already committed the user namespace change *to the calling process*. The ->install handler of user namespaces doesn't touch the nsproxy at all. > + put_user_ns(user_ns); > + } > +#else > + if (flags & CLONE_NEWUSER) > + ret = -EINVAL; > +#endif > + > + if (!ret && wants_ns(flags, CLONE_NEWNS)) > + ret = __ns_install(nsproxy, mnt_ns_to_common(nsp->mnt_ns)); And this one might be even worse, because the mount namespace change itself is only stored in the nsproxy at this point, but the cwd and root paths have already been overwritten on the task's fs_struct. To actually make sys_set_ns() atomic, I think you'd need some moderately complicated prep work, splitting the ->install handlers up into prep work and a commit phase that can't fail. [...] > +#ifdef CONFIG_PID_NS > + if (!ret && wants_ns(flags, CLONE_NEWPID)) { > + struct pid_namespace *pidns; > + > + pidns = task_active_pid_ns(tsk); > + if (pidns) { > + get_pid_ns(pidns); > + ret = __ns_install(nsproxy, &pidns->ns); > + put_pid_ns(pidns); > + } If you can't get the task's pidns, shouldn't that be an error? > + } [...]