Hi, On Wed, May 04, 2016 at 11:30:09PM +0000, Serge Hallyn wrote: > Quoting Djalal Harouni (tixxdz@xxxxxxxxx): > > This is version 2 of the VFS:userns support portable root filesystems > > RFC. Changes since version 1: > > > > * Update documentation and remove some ambiguity about the feature. > > Based on Josh Triplett comments. > > * Use a new email address to send the RFC :-) > > > > > > This RFC tries to explore how to support filesystem operations inside > > user namespace using only VFS and a per mount namespace solution. This > > allows to take advantage of user namespace separations without > > introducing any change at the filesystems level. All this is handled > > with the virtual view of mount namespaces. > > Given your use case, is there any way we could work in some tradeoffs > to protect the host? What I'm thinking is that containers can all > share devices uid-mapped at will, however any device mounted with > uid shifting cannot be used by the inital user namespace. Or maybe > just non-executable in that case, as you'll need enough access to > the fs to set up the containers you want to run. > > So if /dev/sda1 is your host /, you have to use /dev/sda2 as the > container rootfs source. Mount it under /containers with uid > shifting. Now all containers regardless of uid mappings see > the shifted fs contents. But the host root cannot be tricked by > files on it, as /dev/sda2 is non-executable as far as it is > concerned. Of course the whole setup is based on the container manager to setup the right mount namespace, clean mounts, etc then pivot root, boot or whatever... Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ? You create a new mount/pid... namespaces with shift flags, but you are still in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you create new mount/pid namespaces with shift flag (two mount namespaces here if you don't want to race setting MS_SLAVE flag and creating mount namespace and you don't trust other processes... or you want the same nested setup...) This second new secure mount namespace will be the one that you will use to setup the container, device nodes, loops... fs that you want into the container (probably with shift options) and also filesystems that you can't mount inside user namespaces nor want them to show up or propagate into host, you may also want to umount stuff too or remount to change mount options too.., etc anyway here call it the cleaning of the mount namespace. Now during this phase, when you mount and prepare these file systems, mount them with noexec flag first, then remount later with exec, or delay the mounting just before you do a new clone(CLONE_NEWUSER...). During this phase the container manager should get the device that you want to be shared from input or argument, and it will only mount it and prepare it inside new mount namespaces or containers and make sure that it will never be propagated back... After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup the user namespace mapping, I guess you drop capabilities, do setuid() or whatever and start the PID 1 or the app of the container. Now and to not confuse more Dave, since he doesn't like the idea of a shared backing device, and me neither for obvious reasons! the shared device should not be used for a rootfs, maybe for read-only user shared data, or shared config, that's it... but for real rootfs they should have their own *different* backing device! unless you know what you are doing hehe I don't want to confuse people, and I just lack time, will also respond to Dave email. > Just a thought. You think it will solve the case ? Thanks for your comments! -- Djalal Harouni http://opendz.org -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html