Re: [PATCH] mountinfo: implement show_path for kernfs and cgroup

"Serge E. Hallyn" <serge@xxxxxxxxxx> · Fri, 6 May 2016 12:33:56 -0500

Quoting Michael Kerrisk (man-pages) (mtk.manpages@xxxxxxxxx):
> Hi Serge,
> 
> I'll add my own notes below, as much as anything in order to convince 
> myself that I understand what's going on.
> 
> On 05/05/2016 05:20 PM, Serge E. Hallyn wrote:
> > Short explanation:
> > 
> > When showing a cgroupfs entry in mountinfo, show the path of the mount
> > root dentry relative to the reader's cgroup namespace root.
> 
> As part of the commit message, I think it would be useful to add a 
> sentence here explain why this is needed / which applications need it.
> 
> > Long version:
> > 
> > When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
> > namespace, and then mounts a new instance of the freezer cgroup, the new
> > mount will be rooted at /a/b.  The root dentry field of the mountinfo
> > entry will show '/a/b'.
> 
> So, the point is that if we create a new cgroup namespace,
> then we want both /proc/self/cgroup and /proc/self/mountinfo
> to show cgroup paths that are correctly virtualized with
> respect to the cgroup mount point. Previous to this patch,
> /proc/self/cgroup shows the right info, but
> /proc/self/mountinfo does not. (Walk through in a moment.) 
> 
> Is the above a correct summary?

Correct.

> >  cat > /tmp/do1 << EOF
> >  mount -t cgroup -o freezer freezer /mnt
> >  grep freezer /proc/self/mountinfo
> >  EOF
> > 
> >  unshare -Gm  bash /tmp/do1
> 
> Should that not be:
> 
>     unshare -Gm  bash /tmp/do1
> 
> (i.e., s/G/C/)?

Indeed

> >  > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> >  > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
> > 
> > The task's freezer cgroup entry in /proc/self/cgroup will simply show
> > '/':
> > 
> >  grep freezer /proc/self/cgroup
> >  9:freezer:/
> > 
> > If instead the same task simply bind mounts the /a/b cgroup directory,
> > the resulting mountinfo entry will again show /a/b for the dentry root.
> > However in this case the task will find its own cgroup at /mnt/a/b,
> > not at /mnt:
> > 
> >  mount --bind /sys/fs/cgroup/freezer/a/b /mnt
> >  130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
> > 
> > In other words, there is no way for the task to know, based on what is
> > in mountinfo, which cgroup directory is its own.
> 
> So, my walk through and commentary to the above...
> 
> First, a little script to save some typing and verbiage:
> 
> # cat cgroup_info.sh
> #!/bin/sh
> echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
> cat /proc/self/mountinfo | grep freezer |
>         awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
> #
> 
> Create cgroup, place this shell into the cgroup, and look at the state of
> the /proc files:
> 
> # mkdir -p /sys/fs/cgroup/freezer/a/b
> # echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs
> # echo $$
> 2653
> # cat /sys/fs/cgroup/freezer/a/b/cgroup.procs 
> 2653                         # Our shell
> 14254                        # cat(1)
> # ./cgroup_info.sh 
> 	/proc/self/cgroup:	10:freezer:/a/b
> 	mountinfo:		/	/sys/fs/cgroup/freezer
> 
> Create a shell in new cgroup and mount namespaces. The act of creating
> a new cgroup namespace causes the process's current cgroups directories 
> to become its cgroup root directories. (Here, I'm using my own version
> of the "unshare" utility, which takes the same options as the util-linux
> version):
> 
> # ~mtk/tlpi/code/ns/unshare -Cm bash
> 
> Look at the state of the /proc files:
> 
> # ./cgroup_info.sh 
> 	/proc/self/cgroup:	10:freezer:/
> 	mountinfo:		/	/sys/fs/cgroup/freezer
> 
> The third entry in /proc/self/info (the pathname of the cgroup inside the

/proc/self/cgroup

> hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is
> rooted at /a/b in the outer namespace.
> 
> However, the info in /proc/self/mountinfo is not for this cgroup
> namespace, since we are seeing a duplicate of the mount from the 
> old mount namespace, and the info there does not correspond to the
> new cgroup namespace. However, trying to create a new mount still 
> doesn't show us the right information in mountinfo:
> 
> # mount --make-rslave /               # Prevent our mount operations 
>                                       # propagating to other mountns
> # mkdir -p /mnt/freezer               # Create a new mount point
> # umount /sys/fs/cgroup/freezer       # Discard old mount
> # mount -t cgroup -o freezer freezer /mnt/freezer/
> # ./cgroup_info.sh
>         /proc/self/cgroup:      7:freezer:/
>         mountinfo:              /a/b    /mnt/freezer
> 
> The act of creating a new cgroup namespace caused the process's
> current freezer directory, "/a/b", to become its cgroup freezer root
> directory. In other words, the pathname directory of the directory
> within the newly mounted cgroup filesystem should be "/",
> but mountinfo wrongly shows us "/a/b". The consequence of this is
> that the process in the cgroup namespace cannot correctly construct
> the pathname of its cgroup root directory from the information in
> /proc/PID/mountinfo. (The current patch fixes exactly this problem.)
> 
> > With this patch, the dentry root field in mountinfo is shown relative
> > to the reader's cgroup namespace.  I.e.:
> > 
> >  unshare -Gm  bash /tmp/do1
> >  > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> >  > 355 133 0:34 / /mnt rw,relatime - cgroup freezer rw,freezer
> > 
> > This way the task can correlate the paths in /proc/pid/cgroup to
> > /proc/self/mountinfo, and determine which cgroup directory (in any
> > mount which the reader created) corresponds to the task.
> 
> So, I applied your patch against a current (i.e., 4.6-rc6) kernel.
> Same steps as before, and here's what I see:
> 
> # mkdir -p /sys/fs/cgroup/freezer/a/b
> # echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs
> # ./cgroup_info.sh 
> 	/proc/self/cgroup:	10:freezer:/a/b
> 	mountinfo:		/	/sys/fs/cgroup/freezer
> # ~mtk/tlpi/code/ns/unshare -Cm bash
> # ./cgroup_info.sh 
> 	/proc/self/cgroup:	10:freezer:/
> 	mountinfo:		/../..	/sys/fs/cgroup/freezer
> # mount --make-rslave /
> # mkdir -p /mnt/freezer
> # umount /sys/fs/cgroup/freezer
> # mount -t cgroup -o freezer freezer /mnt/freezer/
> # ./cgroup_info.sh
> 	/proc/self/cgroup:	10:freezer:/
> 	mountinfo:		/	/mnt/freezer
> 
> Now the root directory path shown by mountinfo is correct,
> and when we look inside the mount point, we see that things
> look "right" (i.e., a cgroup root directory with no
> subdirectories, and the PID of the shell run by unshare is
> in the cgroup.procs file of this cgroup):
> 
> # ls /mnt/freezer/
> cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
> cgroup.procs           freezer.self_freezing    notify_on_release
> # echo $$
> 3164
> # cat /mnt/freezer/cgroup.procs
> 2653                   # First shell that placed in this cgroup
> 3164                   # Shell started by 'unshare'
> 14197                  # cat(1)
> 
> All makes sense to me.

Right.  So in particular, docker wants to do something like:

bindpath=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $4 }'`
mountpoint=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $5 }'`
mycg=`awk -F: '/freezer/ { print $3 }' /proc/self/cgroup`
cat ${mountpoint}/${bindpath}/${mycg}/cgroup.procs

and see its own task.

> Tested-by: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
> Acked-by: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
> 
> (I did no review of the patch itself though.)

Thanks, Michael.  I'll resend with corrections and a test script of
some sort.

> Cheers,
> 
> Michael
> 
> 
> > Signed-off-by: Serge Hallyn <serge.hallyn@xxxxxxxxxx>
> > ---
> >  fs/kernfs/mount.c      | 14 +++++++++++
> >  include/linux/kernfs.h |  2 ++
> >  kernel/cgroup.c        | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 79 insertions(+)
> > 
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index f73541f..3b78724 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/slab.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/namei.h>
> > +#include <linux/seq_file.h>
> >  
> >  #include "kernfs-internal.h"
> >  
> > @@ -40,6 +41,18 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
> >  	return 0;
> >  }
> >  
> > +static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
> > +{
> > +	struct kernfs_node *node = dentry->d_fsdata;
> > +	struct kernfs_root *root = kernfs_root(node);
> > +	struct kernfs_syscall_ops *scops = root->syscall_ops;
> > +
> > +	if (scops && scops->show_path)
> > +		return scops->show_path(sf, node, root);
> > +
> > +	return seq_dentry(sf, dentry, " \t\n\\");
> > +}
> > +
> >  const struct super_operations kernfs_sops = {
> >  	.statfs		= simple_statfs,
> >  	.drop_inode	= generic_delete_inode,
> > @@ -47,6 +60,7 @@ const struct super_operations kernfs_sops = {
> >  
> >  	.remount_fs	= kernfs_sop_remount_fs,
> >  	.show_options	= kernfs_sop_show_options,
> > +	.show_path	= kernfs_sop_show_path,
> >  };
> >  
> >  /**
> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> > index c06c442..30f089e 100644
> > --- a/include/linux/kernfs.h
> > +++ b/include/linux/kernfs.h
> > @@ -152,6 +152,8 @@ struct kernfs_syscall_ops {
> >  	int (*rmdir)(struct kernfs_node *kn);
> >  	int (*rename)(struct kernfs_node *kn, struct kernfs_node *new_parent,
> >  		      const char *new_name);
> > +	int (*show_path)(struct seq_file *sf, struct kernfs_node *kn,
> > +			 struct kernfs_root *root);
> >  };
> >  
> >  struct kernfs_root {
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 909a7d3..afea39e 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1215,6 +1215,41 @@ static void cgroup_destroy_root(struct cgroup_root *root)
> >  	cgroup_free_root(root);
> >  }
> >  
> > +/*
> > + * look up cgroup associated with current task's cgroup namespace on the
> > + * specified hierarchy
> > + */
> > +static struct cgroup *
> > +current_cgns_cgroup_from_root(struct cgroup_root *root)
> > +{
> > +	struct cgroup *res = NULL;
> > +	struct css_set *cset;
> > +
> > +	lockdep_assert_held(&css_set_lock);
> > +
> > +	rcu_read_lock();
> > +
> > +	cset = current->nsproxy->cgroup_ns->root_cset;
> > +	if (cset == &init_css_set) {
> > +		res = &root->cgrp;
> > +	} else {
> > +		struct cgrp_cset_link *link;
> > +
> > +		list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
> > +			struct cgroup *c = link->cgrp;
> > +
> > +			if (c->root == root) {
> > +				res = c;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	BUG_ON(!res);
> > +	return res;
> > +}
> > +
> >  /* look up cgroup associated with given css_set on the specified hierarchy */
> >  static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
> >  					    struct cgroup_root *root)
> > @@ -1593,6 +1628,33 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
> >  	return 0;
> >  }
> >  
> > +static int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
> > +			    struct kernfs_root *kf_root)
> > +{
> > +	int len = 0, ret = 0;
> > +	char *buf = NULL;
> > +	struct cgroup_root *kf_cgroot = cgroup_root_from_kf(kf_root);
> > +	struct cgroup *ns_cgroup;
> > +
> > +	buf = kmalloc(PATH_MAX, GFP_KERNEL);
> > +	if (!buf)
> > +		return -ENOMEM;
> > +
> > +	spin_lock_bh(&css_set_lock);
> > +	ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot);
> > +	len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);
> > +	spin_unlock_bh(&css_set_lock);
> > +
> > +	if (len >= PATH_MAX)
> > +		len = -ERANGE;
> > +	else if (len > 0) {
> > +		seq_escape(sf, buf, " \t\n\\");
> > +		len = 0;
> > +	}
> > +	kfree(buf);
> > +	return len;
> > +}
> > +
> >  static int cgroup_show_options(struct seq_file *seq,
> >  			       struct kernfs_root *kf_root)
> >  {
> > @@ -5433,6 +5495,7 @@ static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
> >  	.mkdir			= cgroup_mkdir,
> >  	.rmdir			= cgroup_rmdir,
> >  	.rename			= cgroup_rename,
> > +	.show_path		= cgroup_show_path,
> >  };
> >  
> >  static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
> > 
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html