Re: [PATCH v2 3/4] nsproxy: attach to namespaces via pidfds

Christian Brauner <christian.brauner@xxxxxxxxxx> · Mon, 15 Jun 2020 17:45:30 +0200

On June 15, 2020 1:13:37 PM GMT+02:00, "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> wrote:
>Hello Christian,
>
>Looking at this patch, and commit 303cc571d107b that landed in
>5.8-rc1, time namespaces were omitted. I assume this was an accident,
>since the commit message makes no statement about excluding time
>namespaces, and even mentions their existence as part of the
>justification for the patch ("With time namespaces we're looking at
>about 16 syscalls.").
>
>Is a fix needed here?
>
>Thanks,
>
>Michael
>
>On Thu, 30 Apr 2020 at 18:57, Christian Brauner
><christian.brauner@xxxxxxxxxx> wrote:
>>
>> For quite a while we have been thinking about using pidfds to attach
>to
>> namespaces. This patchset has existed for about a year already but
>we've
>> wanted to wait to see how the general api would be received and
>adopted.
>> Now that more and more programs in userspace have started using
>pidfds
>> for process management it's time to send this one out.
>>
>> This patch makes it possible to use pidfds to attach to the
>namespaces
>> of another process, i.e. they can be passed as the first argument to
>the
>> setns() syscall. When only a single namespace type is specified the
>> semantics are equivalent to passing an nsfd. That means
>> setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
>> when a pidfd is passed, multiple namespace flags can be specified in
>the
>> second setns() argument and setns() will attach the caller to all the
>> specified namespaces all at once or to none of them. Specifying 0 is
>not
>> valid together with a pidfd.
>>
>> The obvious example where this is useful is a standard container
>> manager interacting with a running container: pushing and pulling
>files
>> or directories, injecting mounts, attaching/execing any kind of
>process,
>> managing network devices all these operations require attaching to
>all
>> or at least multiple namespaces at the same time. Given that nowadays
>> most containers are spawned with all namespaces enabled we're
>currently
>> looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
>> nsfds, another 7 to actually perform the namespace switch. With time
>> namespaces we're looking at about 16 syscalls.
>> (We could amortize the first 7 or 8 syscalls for opening the nsfds by
>>  stashing them in each container's monitor process but that would
>mean
>>  we need to send around those file descriptors through unix sockets
>>  everytime we want to interact with the container or keep on-disk
>>  state. Even in scenarios where a caller wants to join a particular
>>  namespace in a particular order callers still profit from batching
>>  other namespaces. That mostly applies to the user namespace but
>>  all container runtimes I found join the user namespace first no
>matter
>>  if it privileges or deprivileges the container similar to how
>unshare
>>  behaves.)
>> With pidfds this becomes a single syscall no matter how many
>namespaces
>> are supposed to be attached to.
>>
>> A decently designed, large-scale container manager usually isn't the
>> parent of any of the containers it spawns so the containers don't die
>> when it crashes or needs to update or reinitialize. This means that
>> for the manager to interact with containers through pids is
>inherently
>> racy especially on systems where the maximum pid number is not
>> significicantly bumped. This is even more problematic since we often
>spawn
>> and manage thousands or ten-thousands of containers. Interacting with
>a
>> container through a pid thus can become risky quite quickly.
>Especially
>> since we allow for an administrator to enable advanced features such
>as
>> syscall interception where we're performing syscalls in lieu of the
>> container. In all of those cases we use pidfds if they are available
>and
>> we pass them around as stable references. Using them to setns() to
>the
>> target process' namespaces is as reliable as using nsfds. Either the
>> target process is already dead and we get ESRCH or we manage to
>attach
>> to its namespaces but we can't accidently attach to another process'
>> namespaces. So pidfds lend themselves to be used with this api.
>> The other main advantage is that with this change the pidfd becomes
>the
>> only relevant token for most container interactions and it's the only
>> token we need to create and send around.
>>
>> Apart from significiantly reducing the number of syscalls from double
>> digit to single digit which is a decent reason post-spectre/meltdown
>> this also allows to switch to a set of namespaces atomically, i.e.
>> either attaching to all the specified namespaces succeeds or we fail.
>If
>> we fail we haven't changed a single namespace. There are currently
>three
>> namespaces that can fail (other than for ENOMEM which really is not
>> very interesting since we then have other problems anyway) for
>> non-trivial reasons, user, mount, and pid namespaces. We can fail to
>> attach to a pid namespace if it is not our current active pid
>namespace
>> or a descendant of it. We can fail to attach to a user namespace
>because
>> we are multi-threaded or because our current mount namespace shares
>> filesystem state with other tasks, or because we're trying to setns()
>> to the same user namespace, i.e. the target task has the same user
>> namespace as we do. We can fail to attach to a mount namespace
>because
>> it shares filesystem state with other tasks or because we fail to
>lookup
>> the new root for the new mount namespace. In most non-pathological
>> scenarios these issues can be somewhat mitigated. But there are cases
>where
>> we're half-attached to some namespace and failing to attach to
>another one.
>> I've talked about some of these problem during the hallway track
>(something
>> only the pre-COVID-19 generation will remember) of Plumbers in Los
>Angeles
>> in 2018(?). Even if all these issues could be avoided with super
>careful
>> userspace coding it would be nicer to have this done in-kernel.
>Pidfds seem
>> to lend themselves nicely for this.
>>
>> The other neat thing about this is that setns() becomes an actual
>> counterpart to the namespace bits of unshare().
>>
>> Cc: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
>> Cc: Serge Hallyn <serge@xxxxxxxxxx>
>> Cc: Jann Horn <jannh@xxxxxxxxxx>
>> Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
>> Cc: Aleksa Sarai <cyphar@xxxxxxxxxx>
>> Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx>
>> ---
>> Seems that some people do in fact find this useful and I still have
>> hopes that I can potentially send this for v5.8. I've added some
>> expanded testing to this whole series to catch any mistakes.
>> The lifecycle management for various objects was certainly the
>> most interesting aspect so I'd appreciate a close look.
>> /* v2 */
>> - Michael Kerrisk <mtk.manpages@xxxxxxxxx>:
>>   - Michael pointed out that the semantics for setns(nsfd, 0) and
>>     setns(pidfd, 0) are not comparable. setns(pidfd, 0) is now
>>     disallowed completely. Users wanting to attach to all namespaces
>>     should simply specify them explicitly just as with unshare() and
>>     clone3().
>> - Jann Horn <jannh@xxxxxxxxxx>:
>>   - Jann pointed out that the setns() in its first form wasn't atomic
>in
>>     that userspace could end up in an intermediate state where e.g.
>the
>>     process had moved into the target user namespace but failed to
>move
>>     into the target mount namespace.
>>     In this new version I've removed all intermediate states. There's
>an
>>     installation/preparation state and a commit state similar to
>>     prepare_creds() and commit_creds().
>> ---
>>  fs/nsfs.c               |   7 +-
>>  include/linux/proc_fs.h |   6 ++
>>  kernel/nsproxy.c        | 228
>++++++++++++++++++++++++++++++++++++----
>>  3 files changed, 222 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>> index 4f1205725cfe..b023c1a367c8 100644
>> --- a/fs/nsfs.c
>> +++ b/fs/nsfs.c
>> @@ -229,6 +229,11 @@ int ns_get_name(char *buf, size_t size, struct
>task_struct *task,
>>         return res;
>>  }
>>
>> +bool proc_ns_file(const struct file *file)
>> +{
>> +       return file->f_op == &ns_file_operations;
>> +}
>> +
>>  struct file *proc_ns_fget(int fd)
>>  {
>>         struct file *file;
>> @@ -237,7 +242,7 @@ struct file *proc_ns_fget(int fd)
>>         if (!file)
>>                 return ERR_PTR(-EBADF);
>>
>> -       if (file->f_op != &ns_file_operations)
>> +       if (!proc_ns_file(file))
>>                 goto out_invalid;
>>
>>         return file;
>> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
>> index 45c05fd9c99d..acfd5012db4e 100644
>> --- a/include/linux/proc_fs.h
>> +++ b/include/linux/proc_fs.h
>> @@ -104,6 +104,7 @@ struct proc_dir_entry
>*proc_create_net_single_write(const char *name, umode_t mo
>>                                                     proc_write_t
>write,
>>                                                     void *data);
>>  extern struct pid *tgid_pidfd_to_pid(const struct file *file);
>> +extern bool proc_ns_file(const struct file *file);
>>
>>  #ifdef CONFIG_PROC_PID_ARCH_STATUS
>>  /*
>> @@ -159,6 +160,11 @@ static inline struct pid
>*tgid_pidfd_to_pid(const struct file *file)
>>         return ERR_PTR(-EBADF);
>>  }
>>
>> +static inline bool proc_ns_file(const struct file *file)
>> +{
>> +       return false;
>> +}
>> +
>>  #endif /* CONFIG_PROC_FS */
>>
>>  struct net;
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index 2da463bab58a..8cae508acb9c 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -20,6 +20,7 @@
>>  #include <linux/ipc_namespace.h>
>>  #include <linux/time_namespace.h>
>>  #include <linux/fs_struct.h>
>> +#include <linux/proc_fs.h>
>>  #include <linux/proc_ns.h>
>>  #include <linux/file.h>
>>  #include <linux/syscalls.h>
>> @@ -258,12 +259,53 @@ void exit_task_namespaces(struct task_struct
>*p)
>>         switch_task_namespaces(p, NULL);
>>  }
>>
>> +static int check_setns_flags(unsigned long flags)
>> +{
>> +       if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS |
>CLONE_NEWIPC |
>> +                                CLONE_NEWNET | CLONE_NEWUSER |
>CLONE_NEWPID |
>> +                                CLONE_NEWCGROUP)))
>> +               return -EINVAL;
>> +
>> +#ifndef CONFIG_USER_NS
>> +       if (flags & CLONE_NEWUSER)
>> +               return -EINVAL;
>> +#endif
>> +#ifndef CONFIG_PID_NS
>> +       if (flags & CLONE_NEWPID)
>> +               return -EINVAL;
>> +#endif
>> +#ifndef CONFIG_UTS_NS
>> +       if (flags & CLONE_NEWUTS)
>> +               return -EINVAL;
>> +#endif
>> +#ifndef CONFIG_IPC_NS
>> +       if (flags & CLONE_NEWIPC)
>> +               return -EINVAL;
>> +#endif
>> +#ifndef CONFIG_CGROUPS
>> +       if (flags & CLONE_NEWCGROUP)
>> +               return -EINVAL;
>> +#endif
>> +#ifndef CONFIG_NET_NS
>> +       if (flags & CLONE_NEWNET)
>> +               return -EINVAL;
>> +#endif
>> +
>> +       return 0;
>> +}
>> +
>>  static void put_nsset(struct nsset *nsset)
>>  {
>>         unsigned flags = nsset->flags;
>>
>>         if (flags & CLONE_NEWUSER)
>>                 put_cred(nsset_cred(nsset));
>> +       /*
>> +        * We only created a temporary copy if we attached to more
>than just
>> +        * the mount namespace.
>> +        */
>> +       if (nsset->fs && (flags & CLONE_NEWNS) && (flags &
>~CLONE_NEWNS))
>> +               free_fs_struct(nsset->fs);
>>         if (nsset->nsproxy)
>>                 free_nsproxy(nsset->nsproxy);
>>  }
>> @@ -283,8 +325,14 @@ static int prepare_nsset(unsigned flags, struct
>nsset *nsset)
>>         if (!nsset->cred)
>>                 goto out;
>>
>> -       if (flags & CLONE_NEWNS)
>> +       /* Only create a temporary copy of fs_struct if we really
>need to. */
>> +       if (flags == CLONE_NEWNS) {
>>                 nsset->fs = me->fs;
>> +       } else if (flags & CLONE_NEWNS) {
>> +               nsset->fs = copy_fs_struct(me->fs);
>> +               if (!nsset->fs)
>> +                       goto out;
>> +       }
>>
>>         nsset->flags = flags;
>>         return 0;
>> @@ -294,6 +342,138 @@ static int prepare_nsset(unsigned flags, struct
>nsset *nsset)
>>         return -ENOMEM;
>>  }
>>
>> +static inline int validate_ns(struct nsset *nsset, struct ns_common
>*ns)
>> +{
>> +       return ns->ops->install(nsset, ns);
>> +}
>> +
>> +/*
>> + * This is the inverse operation to unshare().
>> + * Ordering is equivalent to the standard ordering used everywhere
>else
>> + * during unshare and process creation. The switch to the new set of
>> + * namespaces occurs at the point of no return after installation of
>> + * all requested namespaces was successful in commit_nsset().
>> + */
>> +static int validate_nsset(struct nsset *nsset, struct pid *pid)
>> +{
>> +       int ret = 0;
>> +       unsigned flags = nsset->flags;
>> +       struct user_namespace *user_ns = NULL;
>> +       struct pid_namespace *pid_ns = NULL;
>> +       struct nsproxy *nsp;
>> +       struct task_struct *tsk;
>> +
>> +       /* Take a "snapshot" of the target task's namespaces. */
>> +       rcu_read_lock();
>> +       tsk = pid_task(pid, PIDTYPE_PID);
>> +       if (!tsk) {
>> +               rcu_read_unlock();
>> +               return -ESRCH;
>> +       }
>> +
>> +       if (!ptrace_may_access(tsk, PTRACE_MODE_READ_REALCREDS)) {
>> +               rcu_read_unlock();
>> +               return -EPERM;
>> +       }
>> +
>> +       task_lock(tsk);
>> +       nsp = tsk->nsproxy;
>> +       if (nsp)
>> +               get_nsproxy(nsp);
>> +       task_unlock(tsk);
>> +       if (!nsp) {
>> +               rcu_read_unlock();
>> +               return -ESRCH;
>> +       }
>> +
>> +#ifdef CONFIG_PID_NS
>> +       if (flags & CLONE_NEWPID) {
>> +               pid_ns = task_active_pid_ns(tsk);
>> +               if (unlikely(!pid_ns)) {
>> +                       rcu_read_unlock();
>> +                       ret = -ESRCH;
>> +                       goto out;
>> +               }
>> +               get_pid_ns(pid_ns);
>> +       }
>> +#endif
>> +
>> +#ifdef CONFIG_USER_NS
>> +       if (flags & CLONE_NEWUSER)
>> +               user_ns = get_user_ns(__task_cred(tsk)->user_ns);
>> +#endif
>> +       rcu_read_unlock();
>> +
>> +       /*
>> +        * Install requested namespaces. The caller will have
>> +        * verified earlier that the requested namespaces are
>> +        * supported on this kernel. We don't report errors here
>> +        * if a namespace is requested that isn't supported.
>> +        */
>> +#ifdef CONFIG_USER_NS
>> +       if (flags & CLONE_NEWUSER) {
>> +               ret = validate_ns(nsset, &user_ns->ns);
>> +               if (ret)
>> +                       goto out;
>> +       }
>> +#endif
>> +
>> +       if (flags & CLONE_NEWNS) {
>> +               ret = validate_ns(nsset,
>mnt_ns_to_common(nsp->mnt_ns));
>> +               if (ret)
>> +                       goto out;
>> +       }
>> +
>> +#ifdef CONFIG_UTS_NS
>> +       if (flags & CLONE_NEWUTS) {
>> +               ret = validate_ns(nsset, &nsp->uts_ns->ns);
>> +               if (ret)
>> +                       goto out;
>> +       }
>> +#endif
>> +
>> +#ifdef CONFIG_IPC_NS
>> +       if (flags & CLONE_NEWIPC) {
>> +               ret = validate_ns(nsset, &nsp->ipc_ns->ns);
>> +               if (ret)
>> +                       goto out;
>> +       }
>> +#endif
>> +
>> +#ifdef CONFIG_PID_NS
>> +       if (flags & CLONE_NEWPID) {
>> +               ret = validate_ns(nsset, &pid_ns->ns);
>> +               if (ret)
>> +                       goto out;
>> +       }
>> +#endif
>> +
>> +#ifdef CONFIG_CGROUPS
>> +       if (flags & CLONE_NEWCGROUP) {
>> +               ret = validate_ns(nsset, &nsp->cgroup_ns->ns);
>> +               if (ret)
>> +                       goto out;
>> +       }
>> +#endif
>> +
>> +#ifdef CONFIG_NET_NS
>> +       if (flags & CLONE_NEWNET) {
>> +               ret = validate_ns(nsset, &nsp->net_ns->ns);
>> +               if (ret)
>> +                       goto out;
>> +       }
>> +#endif
>> +
>> +out:
>> +       if (pid_ns)
>> +               put_pid_ns(pid_ns);
>> +       if (nsp)
>> +               put_nsproxy(nsp);
>> +       put_user_ns(user_ns);
>> +
>> +       return ret;
>> +}
>> +
>>  /*
>>   * This is the point of no return. There are just a few namespaces
>>   * that do some actual work here and it's sufficiently minimal that
>> @@ -316,6 +496,12 @@ static void commit_nsset(struct nsset *nsset)
>>         }
>>  #endif
>>
>> +       /* We only need to commit if we have used a temporary
>fs_struct. */
>> +       if ((flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS)) {
>> +               set_fs_root(me->fs, &nsset->fs->root);
>> +               set_fs_pwd(me->fs, &nsset->fs->pwd);
>> +       }
>> +
>>  #ifdef CONFIG_IPC_NS
>>         if (flags & CLONE_NEWIPC)
>>                 exit_sem(me);
>> @@ -326,33 +512,39 @@ static void commit_nsset(struct nsset *nsset)
>>         nsset->nsproxy = NULL;
>>  }
>>
>> -static inline int validate_ns(struct nsset *nsset, struct ns_common
>*ns)
>> -{
>> -       return ns->ops->install(nsset, ns);
>> -}
>> -
>> -SYSCALL_DEFINE2(setns, int, fd, int, nstype)
>> +SYSCALL_DEFINE2(setns, int, fd, int, flags)
>>  {
>>         struct task_struct *tsk = current;
>>         struct file *file;
>> -       struct ns_common *ns;
>> +       struct ns_common *ns = NULL;
>>         struct nsset nsset = {};
>> -       int err;
>> -
>> -       file = proc_ns_fget(fd);
>> -       if (IS_ERR(file))
>> -               return PTR_ERR(file);
>> +       int err = 0;
>>
>> -       err = -EINVAL;
>> -       ns = get_proc_ns(file_inode(file));
>> -       if (nstype && (ns->ops->type != nstype))
>> +       file = fget(fd);
>> +       if (!file)
>> +               return -EBADF;
>> +
>> +       if (proc_ns_file(file)) {
>> +               ns = get_proc_ns(file_inode(file));
>> +               if (flags && (ns->ops->type != flags))
>> +                       err = -EINVAL;
>> +               flags = ns->ops->type;
>> +       } else if (pidfd_pid(file)) {
>> +               err = check_setns_flags(flags);
>> +       } else {
>> +               err = -EBADF;
>> +       }
>> +       if (err)
>>                 goto out;
>>
>> -       err = prepare_nsset(nstype, &nsset);
>> +       err = prepare_nsset(flags, &nsset);
>>         if (err)
>>                 goto out;
>>
>> -       err = validate_ns(&nsset, ns);
>> +       if (proc_ns_file(file))
>> +               err = validate_ns(&nsset, ns);
>> +       else
>> +               err = validate_nsset(&nsset, file->private_data);
>>         if (!err) {
>>                 commit_nsset(&nsset);
>>                 perf_event_namespaces(tsk);
>> --
>> 2.26.2
>>

I've omitted them on purpose since we need a small fix first.
I've already discussed this at
https://lore.kernel.org/lkml/20200612144744.orw4yiaosghqlv4w@wittgenstein
I'll send the series next week. I'm on vacation this week.

Thanks!
Christian