Quoting Eric W. Biederman (ebiederm@xxxxxxxxxxxx): > Tejun Heo <tj@xxxxxxxxxx> writes: > > > Hello, Serge. > > > > On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote: > >> More practically, lacking user namespaces you can create a full (i.e. > >> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without > >> root. So this allows you to prevent containers from bypassing devices > >> cgroup restrictions set by the parent. (In reality we are not using > >> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict - > >> but other distros do.) > > > > Do you even mount cgroupfs in containers? If you just bind-mount > > cgroupfs verbatim in containers, I don't think that's gonna work very > > well. If not, all this doesn't make any difference for containers. > > > > So, you don't really have any actual use case for the explicit CAP_* > > checks, right? > > Having thought about this a little more I can give a definitive answer. > > Adding a process to the device control group is equivalent to calling > mknod, as it allows that process to open device nodes, or equivalently > not open device nodes. Therefore a capable check is absolutely > required. > > Without a capability check it would be possible to remove access to > /dev/console for a suid root application keeping it from reporting > attempts to hack it for example. > > update_access can allow access to previously unaccessible devices > and so is equivalent to mknod and as such requires a capability call. > > static int devcgroup_update_access(struct dev_cgroup *devcgroup, > int filetype, const char *buffer) > { > .... > if (!capable(CAP_SYS_ADMIN)) > return -EPERM; > > > Likewise move to a different cgroup can give you a completely different > set of devices you can use. And is roughly equivalent to mknod, and > needs a capability call. > > static int devcgroup_can_attach(struct cgroup *new_cgrp, > struct cgroup_taskset *set) > { > struct task_struct *task = cgroup_taskset_first(set); > > if (current != task && !capable(CAP_SYS_ADMIN)) > return -EPERM; > > > The generic cgroup check in attach_task_by_pid to see if you can move > another process into a cgroup needs to be a capability call and not a > test for uid == 0. > > static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup) > { > if (pid) { > tsk = find_task_by_vpid(pid); > > /* > * even if we're attaching all tasks in the thread group, we > * only need to check permissions on one of them. > */ > tcred = __task_cred(tsk); > if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && > ^^^^^^^^^^^^^^^ > !uid_eq(cred->euid, tcred->uid) && > !uid_eq(cred->euid, tcred->suid)) { > rcu_read_unlock(); > ret = -EACCES; > goto out_unlock_cgroup; > > Eric (full context kept, though long, bc it's all important) Note that part of the problem is simply that the devices cgroup is serving as a stand-in for the lack of both user and device namespaces. If those both existed, we could get rid of the devices cgroup. Likewise, the presence of the devices cgroup makes a device namespace far less compelling :) We can play games with bind mounts into /dev and devcgroup to do most of what we want a devicens for. -serge _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers