On Wed, Nov 27, 2013 at 8:26 AM, Serge E. Hallyn <serge@xxxxxxxxxx> wrote: > Quoting Andy Lutomirski (luto@xxxxxxxxxxxxxx): >> On Wed, Nov 27, 2013 at 6:44 AM, Serge E. Hallyn <serge@xxxxxxxxxx> wrote: >> > Quoting Eric W. Biederman (ebiederm@xxxxxxxxxxxx): >> >> "Serge E. Hallyn" <serge@xxxxxxxxxx> writes: >> >> >> >> > Quoting Andy Lutomirski (luto@xxxxxxxxxxxxxx): >> >> >> IIUC there are multiple ways to end up with a socket pair for which >> >> >> one end is in a user namespace and the other is outside of it. That >> >> >> means that SCM_CREDENTIALS can be used by a process in a userns to >> >> >> authenticate to a process outside. >> >> >> >> >> >> This is all well and good (and, as far as I know, correct), but I'm >> >> > >> >> > And the cgroup manager I'm starting on depends on this. >> >> > >> >> >> not sure this is always the desired behavior. In the context of a >> >> >> tool like Docker, it might be useful to have several user namespaces >> >> >> that have the *same* uids mapped. Nonetheless, if one of those >> >> >> namespaces is compromised, it probably shouldn't be permitted to >> >> >> attack things outside the user namespace (or in the host, if any >> >> >> interesting uids are mapped). >> >> >> >> >> >> Would it make sense to have an option to allow a user namespace to opt >> >> >> into different behavior so that its users show up as the invalid uid >> >> >> as seen from outside (as least for SCM_CREDENTIALS and SO_PEERCRED)? >> >> >> >> >> >> Implementing this might be awkward (ok, it might actively suck due to >> >> >> a possible need for reference counting), but I'm wondering if it's a >> >> >> good idea even in principle. >> >> > >> >> > Well, I'll grant you, if I have a single directory with a socket in >> >> > it, and I make that the aufs or overlayfs underlay for two separate >> >> > mounts, which each are in different containers, then you might have >> >> > a problem here. >> >> > >> >> > Now maybe the answer to that is that the sockets should be created >> >> > in tmpfss (/run, /tmp, etc) anyway. But the more I think about it >> >> > the more I, unfortunately, agree that this could be a problem. >> >> >> >> I really hate the concept of mapping a uid in some contexts and not >> >> others. That seems very prone to go wrong. Given all of the possible >> >> kinds of perumutations I can't imagine how we would get it correct. >> >> >> >> MS_NOSUID and MS_RDONLY will help with some of the worst offenders. >> >> But it will still be possible for the user namespace root to call >> >> setuid(NNN); and create a process with that uid. And if a unix domain >> >> socket isn't the only means of interacting there will still be problems. >> >> >> >> I will suggest that writing a uid mapping filesystem like overlayfs or >> >> perhaps as a mount option of overlayfs is likely to be a more robuse >> >> solution in general. Certainly that is what I originally had on the >> >> drawing board to solve this class of problem. >> > >> > Actually an option to aufs and overlayfs to say "any unix domain socket >> > which is opened must first be copied to the writeable layer" would >> > solve the issue (at least for all reasonable cases, iiuc) >> >> I guess I'm reasonably convinced that overlayfs is the right place to >> fix this. (Containers using lvm will be left in the cold -- oh, >> well.) > > Have you tested that? If I create two LVM snapshots of an LVM, with a > unix sock on the original, and run containers on both snapshots, does > the socket connect the two containers? That won't work, of course. I meant that lvm containers won't be able to remap filesystem uids, which would be an even better fix. --Andy _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers