On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote: > This is version 2 of the VFS:userns support portable root filesystems > RFC. Changes since version 1: > > * Update documentation and remove some ambiguity about the feature. > Based on Josh Triplett comments. > * Use a new email address to send the RFC :-) > > > This RFC tries to explore how to support filesystem operations inside > user namespace using only VFS and a per mount namespace solution. > This > allows to take advantage of user namespace separations without > introducing any change at the filesystems level. All this is handled > with the virtual view of mount namespaces. > > > 1) Presentation: > ================ > > The main aim is to support portable root filesystems and allow > containers, virtual machines and other cases to use the same root > filesystem. Due to security reasons, filesystems can't be mounted > inside user namespaces, and mounting them outside will not solve the > problem since they will show up with the wrong UIDs/GIDs. Read and > write operations will also fail and so on. > > The current userspace solution is to automatically chown the whole > root filesystem before starting a container, example: > (host) init_user_ns 1000000:1065536 => (container) user_ns_X1 > 0:65535 > (host) init_user_ns 2000000:2065536 => (container) user_ns_Y1 > 0:65535 > (host) init_user_ns 3000000:3065536 => (container) user_ns_Z1 > 0:65535 > ... > > Every time a chown is called, files are changed and so on... This > prevents to have portable filesystems where you can throw anywhere > and boot. Having an extra step to adapt the filesystem to the current > mapping and persist it will not allow to verify its integrity, it > makes snapshots and migration a bit harder, and probably other > limitations... > > It seems that there are multiple ways to allow user namespaces > combine nicely with filesystems, but none of them is that easy. The > bind mount and pin the user namespace during mount time will not > work, bind mounts share the same super block, hence you may endup > working on the wrong vfsmount context and there is no easy way to get > out of that... So this option was discussed at the recent LSF/MM summit. The most supported suggestion was that you'd use a new internal fs type that had a struct mount with a new superblock and would copy the underlying inodes but substitute it's own with modified ->getatrr/->setattr calls that did the uid shift. In many ways it would be a remapping bind which would look similar to overlayfs but be a lot simpler. > Using the user namespace in the super block seems the way to go, and > there is the "Support fuse mounts in user namespaces" [1] patches > which seem nice but perhaps too complex!? So I don't think that does what you want. The fuse project I've used before to do uid/gid shifts for build containers is bindfs https://github.com/mpartel/bindfs/ It allows a --map argument where you specify pairs of uids/gids to map (tedious for large ranges, but the map can be fixed to use uid:range instead of individual). > there is also the overlayfs solution, and finaly the VFS layer > solution. > > We present here a simple VFS solution, everything is packed inside > VFS, filesystems don't need to know anything (except probably XFS, > and special operations inside union filesystems). Currently it > supports ext4, btrfs and overlayfs. Changes into filesystems are > small, just parse the vfs_shift_uids and vfs_shift_gids options > during mount and set the appropriate flags into the super_block > structure. So this looks a little daunting. It sprays the VFS with knowledge about the shifts and requires support from every underlying filesystem. A simple remapping bind filesystem would be a lot simpler and require no underlying filesystem support. James -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html