On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <adityakali@xxxxxxxxxx> wrote: > Second take at the Cgroup Namespace patch-set. > > Major changes form RFC (V0): > 1. setns support for cgroupns > 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now > mounts the cgroup hierarcy with cgroupns-root as the filesystem root. > 3. writes to cgroup files outside of cgroupns-root are not allowed > 4. visibility of /proc/<pid>/cgroup is further restricted by not showing > anything if the <pid> is in a sibling cgroupns and its cgroup falls outside > your cgroupns-root. > > More details in the writeup below. > > Background > Cgroups and Namespaces are used together to create “virtual” > containers that isolates the host environment from the processes > running in container. But since cgroups themselves are not > “virtualized”, the task is always able to see global cgroups view > through cgroupfs mount and via /proc/self/cgroup file. > > $ cat /proc/self/cgroup > 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1 > > This exposure of cgroup names to the processes running inside a > container results in some problems: > (1) The container names are typically host-container-management-agent > (systemd, docker/libcontainer, etc.) data and leaking its name (or > leaking the hierarchy) reveals too much information about the host > system. > (2) It makes the container migration across machines (CRIU) more > difficult as the container names need to be unique across the > machines in the migration domain. > (3) It makes it difficult to run container management tools (like > docker/libcontainer, lmctfy, etc.) within virtual containers > without adding dependency on some state/agent present outside the > container. > > Note that the feature proposed here is completely different than the > “ns cgroup” feature which existed in the linux kernel until recently. > The ns cgroup also attempted to connect cgroups and namespaces by > creating a new cgroup every time a new namespace was created. It did > not solve any of the above mentioned problems and was later dropped > from the kernel. Incidentally though, it used the same config option > name CONFIG_CGROUP_NS as used in my prototype! > > Introducing CGroup Namespaces > With unified cgroup hierarchy > (Documentation/cgroups/unified-hierarchy.txt), the containers can now > have a much more coherent cgroup view and its easy to associate a > container with a single cgroup. This also allows us to virtualize the > cgroup view for tasks inside the container. > > The new CGroup Namespace allows a process to “unshare” its cgroup > hierarchy starting from the cgroup its currently in. > For Ex: > $ cat /proc/self/cgroup > 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1 > $ ls -l /proc/self/ns/cgroup > lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] > $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash > [ns]$ ls -l /proc/self/ns/cgroup > lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> > cgroup:[4026532183] > # From within new cgroupns, process sees that its in the root cgroup > [ns]$ cat /proc/self/cgroup > 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ > > # From global cgroupns: > $ cat /proc/<pid>/cgroup > 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1 > > # Unshare cgroupns along with userns and mountns > # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then > # sets up uid/gid map and exec’s /bin/bash > $ ~/unshare -c -u -m > > # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup > # hierarchy. > [ns]$ mount -t cgroup cgroup /tmp/cgroup > [ns]$ ls -l /tmp/cgroup > total 0 > -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers > -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated > -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs > -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control > > The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the > filesystem root for the namespace specific cgroupfs mount. > > The virtualization of /proc/self/cgroup file combined with restricting > the view of cgroup hierarchy by namespace-private cgroupfs mount > should provide a completely isolated cgroup view inside the container. > > In its current form, the cgroup namespaces patcheset provides following > behavior: > > (1) The “root” cgroup for a cgroup namespace is the cgroup in which > the process calling unshare is running. > For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare, > cgroup /batchjobs/c_job_id1 becomes the cgroupns-root. > For the init_cgroup_ns, this is the real root (“/”) cgroup > (identified in code as cgrp_dfl_root.cgrp). > > (2) The cgroupns-root cgroup does not change even if the namespace > creator process later moves to a different cgroup. > $ ~/unshare -c # unshare cgroupns in some cgroup > [ns]$ cat /proc/self/cgroup > 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ > [ns]$ mkdir sub_cgrp_1 > [ns]$ echo 0 > sub_cgrp_1/cgroup.procs > [ns]$ cat /proc/self/cgroup > 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1 > > (3) Each process gets its CGROUPNS specific view of > /proc/<pid>/cgroup. > (a) Processes running inside the cgroup namespace will be able to see > cgroup paths (in /proc/self/cgroup) only inside their root cgroup > [ns]$ sleep 100000 & # From within unshared cgroupns > [1] 7353 > [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs > [ns]$ cat /proc/7353/cgroup > 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1 > > (b) From global cgroupns, the real cgroup path will be visible: > $ cat /proc/7353/cgroup > 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1 This is a little weird. Not sure it's a problem. > > (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup > path will be visible: > # ns2's cgroupns-root is at '/batchjobs/c_job_id2' > [ns2]$ cat /proc/7353/cgroup > [ns2]$ > This is same as when cgroup hierarchy is not mounted at all. > (In correct container setup though, it should not be possible to > access PIDs in another container in the first place.) > > (4) Processes inside a cgroupns are not allowed to move out of the > cgroupns-root. This is true even if a privileged process in global > cgroupns tries to move the process out of its cgroupns-root. > > # From global cgroupns > $ cat /proc/7353/cgroup > 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1 > # cgroupns-root for 7353 is /batchjobs/c_job_id1 > $ echo 7353 > batchjobs/c_job_id2/cgroup.procs > -bash: echo: write error: Operation not permitted > > > (6) When some thread from a multi-threaded process unshares its > cgroup-namespace, the new cgroupns gets applied to the entire > process (all the threads). This should be OK since > unified-hierarchy only allows process-level containerization. So > all the threads in the process will have the same cgroup. And both > - changing cgroups and unsharing namespaces - are protected under > threadgroup_lock(task). This seems odd to me. Does unsharing the cgroupns unshare for all tasks in the process? If not, then I think that it shouldn't change the cgroup either. What did you end up doing to grant permission to unshare the cgroup ns? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html