On Mon, Nov 3, 2014 at 2:43 PM, Aditya Kali <adityakali@xxxxxxxxxx> wrote: > > > On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > wrote: >> >> Aditya Kali <adityakali@xxxxxxxxxx> writes: >> >> > This patch enables cgroup mounting inside userns when a process >> > as appropriate privileges. The cgroup filesystem mounted is >> > rooted at the cgroupns-root. Thus, in a container-setup, only >> > the hierarchy under the cgroupns-root is exposed inside the container. >> > This allows container management tools to run inside the containers >> > without depending on any global state. >> > In order to support this, a new kernfs api is added to lookup the >> > dentry for the cgroupns-root. >> >> There is a misdesign in this. Because files already exist we need the >> protections that are present in proc and sysfs that only allow you to >> mount the filesystem if it is already mounted. Otherwise you can wind >> up mounting this cgroupfs in a chroot jail when the global root would >> not like you to see it. cgroupfs isn't as bad as proc and sys but there >> is at the very least an information leak in mounting it. >> > > I think simply mounting the cgroupfs doesn't give you any more information > than what you don't already know about the system ; specially if the > visibility is restricted within the process's cgroupns-root. The cgroups > still wont be writable by the user, so I think it should be fine to allow > mounting? > Can we try to figure out a better way to do this than checking at mount time for a fully-visible procfs/sysfs/cgroupfs? The current approach is unpleasant to deal with. For example, we could check the equivalent conditions when the userns is created and store then in a per-userns flags field. > >> >> Given that we are effectively performing a bind mount in this patch, and >> that we need to require cgroupfs be mounted anyway (to be safe). >> >> I don't see the point of this change. >> >> If we could change the set of cgroups or visible in cgroupfs I could >> probably see the point. But as it is this change seems to be pointless. >> > > I agree that this is effectively bind-mounting, but doing this in kernel > makes it really convenient for the userspace. The process that sets up the > container doesn't need to care whether it should bind-mount cgroupfs inside > the container or not. The tasks inside the container can mount cgroupfs on > as-needed basis. The root container manager can simply unshare cgroupns and > forget about the internal setup. I think this is useful just for the reason > that it makes life much simpler for userspace. > If we add the fully-visible check at mount time, then I almost agree with Eric. I say almost because fs_fully_visible isn't checking whether the superblock root is the thing that's mounted, and, if we fix that, then bind-mounting like this becomes impossible and we'd have to refine the check. But if we come up with something less gross than checking for fs visibility at mount time, then I agree with Aditya: let's let mount do the right thing, since there may be nothing there to bind mount. If we go that route, then I think we might want to make it explicit: require a mount flag like root=. to indicate that we want to be rooted at our cgroupns's root cgroup. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html