On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > Aditya Kali <adityakali@xxxxxxxxxx> writes: > > > This patch enables cgroup mounting inside userns when a process > > as appropriate privileges. The cgroup filesystem mounted is > > rooted at the cgroupns-root. Thus, in a container-setup, only > > the hierarchy under the cgroupns-root is exposed inside the container. > > This allows container management tools to run inside the containers > > without depending on any global state. > > In order to support this, a new kernfs api is added to lookup the > > dentry for the cgroupns-root. > > There is a misdesign in this. Because files already exist we need the > protections that are present in proc and sysfs that only allow you to > mount the filesystem if it is already mounted. Otherwise you can wind > up mounting this cgroupfs in a chroot jail when the global root would > not like you to see it. cgroupfs isn't as bad as proc and sys but there > is at the very least an information leak in mounting it. > > I think simply mounting the cgroupfs doesn't give you any more information than what you don't already know about the system ; specially if the visibility is restricted within the process's cgroupns-root. The cgroups still wont be writable by the user, so I think it should be fine to allow mounting? > Given that we are effectively performing a bind mount in this patch, and > that we need to require cgroupfs be mounted anyway (to be safe). > > I don't see the point of this change. > > If we could change the set of cgroups or visible in cgroupfs I could > probably see the point. But as it is this change seems to be pointless. > > I agree that this is effectively bind-mounting, but doing this in kernel makes it really convenient for the userspace. The process that sets up the container doesn't need to care whether it should bind-mount cgroupfs inside the container or not. The tasks inside the container can mount cgroupfs on as-needed basis. The root container manager can simply unshare cgroupns and forget about the internal setup. I think this is useful just for the reason that it makes life much simpler for userspace. > Eric > > > > Signed-off-by: Aditya Kali <adityakali@xxxxxxxxxx> > > --- > > fs/kernfs/mount.c | 48 > ++++++++++++++++++++++++++++++++++++++++++++++++ > > include/linux/kernfs.h | 2 ++ > > kernel/cgroup.c | 47 > +++++++++++++++++++++++++++++++++++++++++++++-- > > 3 files changed, 95 insertions(+), 2 deletions(-) > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > index f973ae9..e334f45 100644 > > --- a/fs/kernfs/mount.c > > +++ b/fs/kernfs/mount.c > > @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct > super_block *sb) > > return NULL; > > } > > > > +/** > > + * kernfs_make_root - create new root dentry for the given kernfs_node. > > + * @sb: the kernfs super_block > > + * @kn: kernfs_node for which a dentry is needed > > + * > > + * This can used used by callers which want to mount only a part of the > kernfs > > + * as root of the filesystem. > > + */ > > +struct dentry *kernfs_obtain_root(struct super_block *sb, > > + struct kernfs_node *kn) > > +{ > > + struct dentry *dentry; > > + struct inode *inode; > > + > > + BUG_ON(sb->s_op != &kernfs_sops); > > + > > + /* inode for the given kernfs_node should already exist. */ > > + inode = ilookup(sb, kn->ino); > > + if (!inode) { > > + pr_debug("kernfs: could not get inode for '"); > > + pr_cont_kernfs_path(kn); > > + pr_cont("'.\n"); > > + return ERR_PTR(-EINVAL); > > + } > > + > > + /* instantiate and link root dentry */ > > + dentry = d_obtain_root(inode); > > + if (!dentry) { > > + pr_debug("kernfs: could not get dentry for '"); > > + pr_cont_kernfs_path(kn); > > + pr_cont("'.\n"); > > + return ERR_PTR(-ENOMEM); > > + } > > + > > + /* If this is a new dentry, set it up. We need kernfs_mutex > because this > > + * may be called by callers other than kernfs_fill_super. */ > > + mutex_lock(&kernfs_mutex); > > + if (!dentry->d_fsdata) { > > + kernfs_get(kn); > > + dentry->d_fsdata = kn; > > + } else { > > + WARN_ON(dentry->d_fsdata != kn); > > + } > > + mutex_unlock(&kernfs_mutex); > > + > > + return dentry; > > +} > > + > > static int kernfs_fill_super(struct super_block *sb, unsigned long > magic) > > { > > struct kernfs_super_info *info = kernfs_info(sb); > > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h > > index 3c2be75..b9538e0 100644 > > --- a/include/linux/kernfs.h > > +++ b/include/linux/kernfs.h > > @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn); > > struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry); > > struct kernfs_root *kernfs_root_from_sb(struct super_block *sb); > > > > +struct dentry *kernfs_obtain_root(struct super_block *sb, > > + struct kernfs_node *kn); > > struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops, > > unsigned int flags, void *priv); > > void kernfs_destroy_root(struct kernfs_root *root); > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > > index 7e5d597..250aaec 100644 > > --- a/kernel/cgroup.c > > +++ b/kernel/cgroup.c > > @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, > struct cgroup_sb_opts *opts) > > > > memset(opts, 0, sizeof(*opts)); > > > > + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init > cgroup > > + * namespace. > > + */ > > + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) { > > + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR; > > + } > > + > > while ((token = strsep(&o, ",")) != NULL) { > > nr_opts++; > > > > @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, > struct cgroup_sb_opts *opts) > > > > if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) { > > pr_warn("sane_behavior: this is still under development > and its behaviors will change, proceed at your own risk\n"); > > - if (nr_opts != 1) { > > + if (nr_opts > 1) { > > pr_err("sane_behavior: no other mount options > allowed\n"); > > return -EINVAL; > > } > > @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root > *root, > > set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags); > > } > > > > +struct dentry *cgroupns_get_root(struct super_block *sb, > > + struct cgroup_namespace *ns) > > +{ > > + struct dentry *nsdentry; > > + > > + nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn); > > + return nsdentry; > > +} > > + > > static int cgroup_setup_root(struct cgroup_root *root, unsigned int > ss_mask) > > { > > LIST_HEAD(tmp_links); > > @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct > file_system_type *fs_type, > > int ret; > > int i; > > bool new_sb; > > + struct cgroup_namespace *ns = > > + get_cgroup_ns(current->nsproxy->cgroup_ns); > > + > > + /* Check if the caller has permission to mount. */ > > + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) { > > + put_cgroup_ns(ns); > > + return ERR_PTR(-EPERM); > > + } > > > > /* > > * The first time anyone tries to mount a cgroup, enable the list > > @@ -1817,11 +1841,28 @@ out_free: > > kfree(opts.release_agent); > > kfree(opts.name); > > > > - if (ret) > > + if (ret) { > > + put_cgroup_ns(ns); > > return ERR_PTR(ret); > > + } > > > > dentry = kernfs_mount(fs_type, flags, root->kf_root, > > CGROUP_SUPER_MAGIC, &new_sb); > > + > > + if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) { > > + /* If this mount is for the default hierarchy in non-init > cgroup > > + * namespace, then instead of root cgroup's dentry, we > return > > + * the dentry corresponding to the cgroupns->root_cgrp. > > + */ > > + if (ns != &init_cgroup_ns) { > > + struct dentry *nsdentry; > > + > > + nsdentry = cgroupns_get_root(dentry->d_sb, ns); > > + dput(dentry); > > + dentry = nsdentry; > > + } > > + } > > + > > if (IS_ERR(dentry) || !new_sb) > > cgroup_put(&root->cgrp); > > > > @@ -1834,6 +1875,7 @@ out_free: > > deactivate_super(pinned_sb); > > } > > > > + put_cgroup_ns(ns); > > return dentry; > > } > > > > @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = { > > .name = "cgroup", > > .mount = cgroup_mount, > > .kill_sb = cgroup_kill_sb, > > + .fs_flags = FS_USERNS_MOUNT, > > }; > > > > static struct kobject *cgroup_kobj; > -- Aditya _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers