On Thu, Apr 30, 2020 at 01:09:30PM -0500, Eric W. Biederman wrote: > Christian Brauner <christian.brauner@xxxxxxxxxx> writes: > > > Add a simple capability helper which makes it possible to determine > > whether a set of creds is ns capable wrt to the passed in credentials. > > This is not something exciting it's just a more pleasant wrapper around > > security_capable() by allowing ns_capable_common() to ake a const struct > > cred argument. In ptrace_has_cap() for example, we're using > > security_capable() directly. ns_capable_cred() will be used in the next > > patch to check against the target credentials the caller is going to > > switch to. > > Given that this is to suppot setns. I don't understand the > justification for this. > > Is it your intention to use the reduced permissions that you get > when you install a user namespace? Indeed. > > Why do you want to use the reduced permissions when installing multiple > namespaces at once? The intention is to use the target credentials we are going to install when setns() hits the point of no return. The target permissions are either the permissions of the caller or the reduced permissions if CLONE_NEWUSER has been requested. This has multiple reasons. The most obvious reason imho is that all other namespaces have an owning user namespace. Attaching to any other namespace requires the attacher to be privileged over the owning user namespace of that namespace. Consequently, all our current install handlers for every namespace we have check whether we are privileged in the owning user namespace of that user namespace. So in order to attach to any of those namespaces - especially when attaching as an unprivileged user - requires that we are attached to the user namespace first. (That's especially useful given that most users especially container runtimes will unshare all namespaces. Doing it this way we can not just have attach privileged users attach to their containers but also unprivileged users to their containers in one shot.) A few other points about this. If one looks at clone(CLONE_NEW*) or unshare(CLONE_NEW*) then the ordering and permissions checking is the same way. All permissions checks are performed against the reduced permissions, i.e. if CLONE_NEWUSER is specified you check privilege against the reduced permissions too otherwise you wouldn't be able to spawn into a complete set of new namespaces as an unprivileged user. This logic is also expressed in how setns() is already used in userspace. Any container runtime will attach to the user namespace first, so all subsequent calls to attach to other namespaces perform the checks against the reduced permissions. It also has to be that way because of fully unprivileged containers. To put it another way, if we were to always perform the permission checks against the current permissions (i.e. no matter if CLONE_NEWUSER is specified or not) then we'd never be able to attach to a set of namespaces at once as an unprivileged user. We also really want to be able to express both semantics: 1. setns(flags & ~CLONE_NEWUSER) --> attach to all namespaces with my current permission level 2. setns(flags | CLONE_NEWUSER) attach to all namespaces with the target permission level It feels weird if both 1 and 2 would mean the exact same thing given that the user namespace has an owernship relation with all the other namespaces. Christian