On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali@xxxxxxxxxx> wrote: > On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali@xxxxxxxxxx> wrote: >>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali@xxxxxxxxxx> wrote: >>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman >>>>>> <ebiederm@xxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler >>>>>>> implementation. >>>>>> >>>>>> Could be. I'll defer to Aditya for that one. >>>>>> >>>>> >>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In >>>>> addition to restricting the process to a cgroup-root, new processes >>>>> entering the container should also be implicitly contained within the >>>>> cgroup-root of that container. >>>> >>>> Why? Concretely, why should this be in the kernel namespace code >>>> instead of in userspace? >>>> >>> >>> Userspace can do it too. Though then there will be possibility of >>> having processes in the same mount namespace with different >>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even >>> more complex. Thats another reason why it might not be good idea to >>> tie cgroups with mount namespace. >>> >>>>> Implementing pivot_cgroup_root would >>>>> probably involve overloading mount-namespace to now understand cgroup >>>>> filesystem too. I did attempt combining cgroupns-root with mntns >>>>> earlier (not via a new syscall though), but came to the conclusion >>>>> that its just simpler to have a separate cgroup namespace and get >>>>> clear semantics. One of the issues was that implicitly changing cgroup >>>>> on setns to mntns seemed like a huge undesirable side-effect. >>>>> >>>>> About pinning: I really feel that it should be OK to pin processes >>>>> within cgroupns-root. I think thats one of the most important feature >>>>> of cgroup-namespace since its most common usecase is to containerize >>>>> un-trusted processes - processes that, for their entire lifetime, need >>>>> to remain inside their container. >>>> >>>> So don't let them out. None of the other namespaces have this kind of >>>> constraint: >>>> >>>> - If you're in a mntns, you can still use fds from outside. >>>> - If you're in a netns, you can still use sockets from outside the namespace. >>>> - If you're in an ipcns, you can still use ipc handles from outside. >>> >>> But none of the namespaces allow you to allocate new fds/sockets/ipc >>> handles in the outside namespace. I think moving a process outside of >>> cgroupns-root is like allocating a resource outside of your namespace. >> >> In a pidns, you can see outside tasks if you have an outside procfs >> mounted, but, if you don't, then you can't. Wouldn't cgroupns be just >> like that? You wouldn't be able to escape your cgroup as long as you >> don't have an inappropriate cgroupfs mounted. >> > > I am not if we should only depend on restricted visibility for this > though. More details below. > >> >>>> >>>>> And with explicit permission from >>>>> cgroup subsystem (something like cgroup.may_unshare as you had >>>>> suggested previously), we can make sure that unprivileged processes >>>>> cannot pin themselves. Also, maintaining this invariant (your current >>>>> cgroup is always under your cgroupns-root) keeps the code and the >>>>> semantics simple. >>>> >>>> I actually think it makes the semantics more complex. The less policy >>>> you stick in the kernel, the easier it is to understand the impact of >>>> that policy. >>>> >>> >>> My inclination is towards keeping things simpler - both in code as >>> well as in configuration. I agree that cgroupns might seem >>> "less-flexible", but in its current form, it encourages consistent >>> container configuration. If you have a process that needs to move >>> around between cgroups belonging to different containers, then that >>> process should probably not be inside any container's cgroup >>> namespace. Allowing that will just make the cgroup namespace >>> pretty-much meaningless. >> >> The problem with pinning is that preventing it causes problems >> (specifically, either something potentially complex and incompatible >> needs to be added or unprivileged processes will be able to pin >> themselves). >> >> Unless I'm missing something, a normal cgroupns user doesn't actually >> need kernel pinning support to effectively constrain its members' >> cgroups. >> > > So there are 2 scenarios to consider: > > We have 2 containers with cgroups: /container1 and /container2 > Assume process P is running under cgroupns-root '/container1' > > (1) process P wants to 'write' to cgroup.procs outside its > cgroupns-root (say to /container2/cgroup.procs) This, at least, doesn't have the problem with unprivileged processes pinning themselves. > (2) An admin process running in init_cgroup_ns (or any parent cgroupns > with cgroupns-root above /container1) wants to write pid of process P > to /container2/cgroup.procs (which lies outside of P's cgroupns-root) > > For (1), I think its ok to reject such a write. This is consistent > with the restriction in cgroup_file_write added in 'Patch 6' of this > set. I believe this should be independent of visibility of the cgroup > hierarchy for P. > > For (2), we may allow the write to succeed if we make sure that the > process doing the write is an admin process (with CAP_SYS_ADMIN in its > userns AND over P's cgroupns->user_ns). Why is its userns relevant? Why not just check whether the target cgroup is in the process doing the write's cgroupns? (NB: you need to check f_cred, here, not current_cred(), but that's orthogonal.) Then the policy becomes: no user of cgroupfs can move any process outside of the cgroupfs's user's cgroupns root. I think I'm okay with this. > If this write succeeds, then: > (a) process P's /proc/<pid>/cgroup does not show anything when viewed > by 'self' or any other process in P's cgrgroupns. I would really like > to avoid showing relative paths or paths outside the cgroupns-root The empty string seems just as problematic to me. > (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be > only able to mount and see cgroup hierarchy under its cgroupns-root > (d) if process P tries to write to any cgroup file outside of its > cgroupns-root (assuming that hierarchy is visible to it for whatever > reason), it will fail as in (1) I'm still unconvinced that this serves any purpose. If you give DAC/MAC permission to a task to write to something, and you give it access to an fd or mount pointing there, and you don't want it writing there, then *don't do that*. I'm not really seeing why cgroupns needs special treatment here. > > i.e., in summary, you can't escape out of cgroupns-root yourself. You > will need help from an admin process running under some parent > cgroupns-root to move you out. Is that workable for your usecase? Most > of the things above already happen with the current patch-set, so it > should be easy to enable this. > > Though there are still some open issues like: > * what happens if you move all the processes out of /container1 and > then 'rmdir /container1'? As it is now, you won't be able to setns() > to that cgroupns anymore. But the cgroupns will still hang around > until the processes switch their cgroupns. Seems okay. > * should we then also allow setns() without first entering the > cgroupns-root? setns also checks the same conditions as in (a) plus it > checks that your current cgroup is descendant of target cgroupns-root. > Alternatively we can special-case setns() to own cgroupns so that it > doesn't fail. I think setns should completely ignore the caller's cgroup and should not change it. Userspace can do this. > * migration for these processes will be tricky, if not impossible. But > the admin trying to do this probably doesn't care about it or will > provision for it. Migration for processes in a mntns that have a current directory outside their mntns is also difficult or impossible. Same with pidnses with an fd pointing at /proc/self from an outside-the-pid-ns procfs. Nothing new here. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html