On Sat, May 02, 2020 at 09:52:03AM -0500, Eric W. Biederman wrote: > Christian Brauner <christian.brauner@xxxxxxxxxx> writes: > > > On Sat, May 02, 2020 at 07:35:53AM -0500, Eric W. Biederman wrote: > >> Christian Brauner <christian.brauner@xxxxxxxxxx> writes: > >> > >> > On Thu, Apr 30, 2020 at 01:09:30PM -0500, Eric W. Biederman wrote: > >> >> Christian Brauner <christian.brauner@xxxxxxxxxx> writes: > >> >> > >> >> > Add a simple capability helper which makes it possible to determine > >> >> > whether a set of creds is ns capable wrt to the passed in credentials. > >> >> > This is not something exciting it's just a more pleasant wrapper around > >> >> > security_capable() by allowing ns_capable_common() to ake a const struct > >> >> > cred argument. In ptrace_has_cap() for example, we're using > >> >> > security_capable() directly. ns_capable_cred() will be used in the next > >> >> > patch to check against the target credentials the caller is going to > >> >> > switch to. > >> >> > >> >> Given that this is to suppot setns. I don't understand the > >> >> justification for this. > >> >> > >> >> Is it your intention to use the reduced permissions that you get > >> >> when you install a user namespace? > >> > > >> > Indeed. > >> > > >> >> > >> >> Why do you want to use the reduced permissions when installing multiple > >> >> namespaces at once? > >> > > >> > The intention is to use the target credentials we are going to install > >> > when setns() hits the point of no return. The target permissions are > >> > either the permissions of the caller or the reduced permissions if > >> > CLONE_NEWUSER has been requested. This has multiple reasons. > >> > > >> > The most obvious reason imho is that all other namespaces have an owning > >> > user namespace. Attaching to any other namespace requires the attacher > >> > to be privileged over the owning user namespace of that namespace. > >> > Consequently, all our current install handlers for every namespace we > >> > have check whether we are privileged in the owning user namespace of > >> > that user namespace. So in order to attach to any of those namespaces - > >> > especially when attaching as an unprivileged user - requires that we are > >> > attached to the user namespace first. > >> > >> No actually it doesn't. As if you have privileges to attach to the user > >> namespace you have the privileges to attach to anything it owns. Or you > >> should I think I am missing some subtle detail at the moment. > > > > Sorry, I phrased that very misleadingly. What I'm talking about should > > be obvious in the second patch: > > > > -static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new) > > +static int utsns_install(struct nsset *nsset, struct ns_common *new) > > { > > + struct nsproxy *nsproxy = nsset->nsproxy; > > struct uts_namespace *ns = to_uts_ns(new); > > > > - if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) || > > - !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) > > + if (!ns_capable_cred(nsset->cred, ns->user_ns, CAP_SYS_ADMIN) || > > + !ns_capable_cred(nsset->cred, nsset->cred->user_ns, CAP_SYS_ADMIN)) > > return -EPERM; > > > > All our current install handlers check that we're privileged wrt to our > > _current_ user namespace, i.e. they all have a > > ns_capable(current_user_ns(), CAP_SYS_ADMIN) > > line. So if we specify > > setns(CLONE_NEWUSER | CLONE_NEW*) > > as an unprivileged user we aren't able to attach to any other namespace > > unless we check against the target credentials. > > > >> > >> > >> > (That's especially useful given > >> > that most users especially container runtimes will unshare all > >> > namespaces. Doing it this way we can not just have attach privileged > >> > users attach to their containers but also unprivileged users to their > >> > containers in one shot.) > >> > >> That is a wonderful reason for doing things, and it is the reason > >> why I am asking about it because I think you have it backwards. > >> > >> Especially in the context of some container runtimes like Kubernetes > >> that I have been told will do things like share a network namespace > >> across all containers in a POD. > > > > Kubernetes doesn't use user namespace at all so they need to run as root > > anyway so that's not a problem. And if we're talking about scenarios > > where some namespaces are shared and other's aren't then there'll be > > batch-attaching any, i.e. not all at once but some at once. So I don't > > think this is a good argument. > > > >> > >> > A few other points about this. If one looks at clone(CLONE_NEW*) or > >> > unshare(CLONE_NEW*) then the ordering and permissions checking is the > >> > same way. All permissions checks are performed against the reduced > >> > permissions, i.e. if CLONE_NEWUSER is specified you check privilege > >> > against the reduced permissions too otherwise you wouldn't be able to > >> > spawn into a complete set of new namespaces as an unprivileged user. > >> > >> That is a good catch and definitely a reason for looking at doing > >> things in this order. > >> > >> For unshare and clone putting things in a user namespace means you can > >> create namespaces you could not create otherwise. > >> > >> > >> > This logic is also expressed in how setns() is already used in > >> > userspace. Any container runtime will attach to the user namespace first, > >> > so all subsequent calls to attach to other namespaces perform the checks > >> > against the reduced permissions. It also has to be that way because of > >> > fully unprivileged containers. > >> > >> So I sat and looked. For nsetner it winds up trying to enter > >> the namespaces in either order. > >> > >> /* > >> * Now that we know which namespaces we want to enter, enter > >> * them. Do this in two passes, not entering the user > >> * namespace on the first pass. So if we're deprivileging the > >> * container we'll enter the user namespace last and if we're > >> * privileging it then we enter the user namespace first > >> * (because the initial setns will fail). > >> */ > >> for (pass = 0; pass < 2; pass ++) { > >> for (nsfile = namespace_files + 1 - pass; nsfile->nstype; nsfile++) { > >> if (nsfile->fd < 0) > >> continue; > >> if (nsfile->nstype == CLONE_NEWPID && do_fork == -1) > >> do_fork = 1; > >> if (setns(nsfile->fd, nsfile->nstype)) { > >> if (pass != 0) > >> err(EXIT_FAILURE, > >> _("reassociate to namespace '%s' failed"), > >> nsfile->name); > >> else > >> continue; > >> } > >> > >> close(nsfile->fd); > >> nsfile->fd = -1; > >> } > >> } > >> > >> > >> Looking a little close we have at least for entering the mntns the > >> following checks: > >> > >> if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) || > >> !ns_capable(current_user_ns(), CAP_SYS_CHROOT) || > >> !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) > >> return -EPERM; > >> > >> Which require CAP_SYS_CHROOT and CAP_SYS_ADMIN in the current user namespace. > >> > >> So there is defintiely an issue there. > >> > >> I suspect the clean approach is to simply require CAP_SYS_CHROOT in > >> the new user namespace, when we are setting several of them at once. > > > > See my example above. All install handlers check for > > ns_capable(current_user_ns(), CAP_SYS_ADMIN). > > > > > > static int ipcns_install(struct nsproxy *nsproxy, struct ns_common *new) > > { > > struct ipc_namespace *ns = to_ipc_ns(new); > > if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) || > > !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) > > return -EPERM; > > > > > > static int netns_install(struct nsproxy *nsproxy, struct ns_common *ns) > > { > > struct net *net = to_net_ns(ns); > > > > if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) || > > !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) > > return -EPERM; > > > > static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns) > > { > > struct fs_struct *fs = current->fs; > > struct mnt_namespace *mnt_ns = to_mnt_ns(ns), *old_mnt_ns; > > struct path root; > > int err; > > > > if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) || > > !ns_capable(current_user_ns(), CAP_SYS_CHROOT) || > > !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) > > return -EPERM; > > > > static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns) > > { > > struct cgroup_namespace *cgroup_ns = to_cg_ns(ns); > > > > if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) || > > !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN)) > > return -EPERM; > > > > static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new) > > { > > struct uts_namespace *ns = to_uts_ns(new); > > > > if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) || > > !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) > > return -EPERM; > > > > static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns) > > { > > struct pid_namespace *active = task_active_pid_ns(current); > > struct pid_namespace *ancestor, *new = to_pid_ns(ns); > > > > if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) || > > !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) > > return -EPERM; > > > >> > >> > To put it another way, if we were to always perform the permission checks > >> > against the current permissions (i.e. no matter if CLONE_NEWUSER is > >> > specified or not) then we'd never be able to attach to a set of > >> > namespaces at once as an unprivileged user. > >> > We also really want to be able to express both semantics: > >> > 1. setns(flags & ~CLONE_NEWUSER) --> attach to all namespaces with my > >> > current permission level > >> > 2. setns(flags | CLONE_NEWUSER) attach to all namespaces with the target > >> > permission level > >> > It feels weird if both 1 and 2 would mean the exact same thing given > >> > that the user namespace has an owernship relation with all the other > >> > namespaces. > >> > >> It feels weird to me to disallow anything that we have permissions for. > >> > >> Can you dig up the actual install permissions checks and see if there is > >> anything other than the mount namespace that needs permissions in the > >> current user namespace? > > > > Yep did, all of them check ns_capable(current_user_ns(), CAP_SYS_ADMIN), > > see above please. > > > >> > >> Please let's walk this through. I think there should be a way to > >> carefully phrase the permission checks that we don't need > >> ns_capable_cred that will allow goofy cases like setns into Kuberneties > >> PODs that share network namespaces. > > > > Hm, Kubernetes doesn't use user namespace. I think I misunderstand your > > concern maybe. But see below for a suggestion. > > > >> > >> I believe that will be a way to phrase the permission checks so that > >> with or without CLONE_NEWUSER they make sense, and give very similar > >> results. > >> > >> Certainly attaching to a mount namespace is going to need either being > >> root or attaching to a user namespace at the same time. Because > >> attaching to a mount namespace needs functionality that the user > >> namespace provides. > > > > So how about we add a new flag to setns() > > SETNS_NEWUSER_CRED that is only valid in > > conjunction with CLONE_NEWUSER and which allows callers to tell the > > kernel "assume the target credentials first". This way we can support > > both cases and users can specify what they want and we can clearly > > document it on the manpages. > > Let's not get complicated. Let's make this very simple. I'd call it "flexible". :) > > Change "ns_capable(current_user_user_ns(), CAP_SYS_ADMIN)" > to "ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)". > > We still use the current credentials, but we require the caller of setns > to have CAP_SYS_ADMIN in the user namespace you are going to wind up in. > That is exactly what userns_install does today. > > That won't be any kind of security hole because we still require > "ns_capable(ns->user_ns, CAP_SYS_ADMIN)" > on all of the namespaces as well. > > That way if you are sufficiently privileged you can still handle joining > namespaces that are not owned by the owner of the processes user > namespace. So we can join weird processes. > > Does that make sense? Well, it's funny that you say that since I had a version of that patchset and I still have it at setns_pidfd_v2_wip_v5. I justed pushed it to https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=setns_pidfd_v2_wip_v5 The reason I refrained from doing this was fear of introducing a security hole but since we agree that this would be fine let's give it a go. In any case, I think I never ran the test-suite I added on that version. Let me plug this in and test this. Christian