This is version 2 of the VFS:userns support portable root filesystems RFC. Changes since version 1: * Update documentation and remove some ambiguity about the feature. Based on Josh Triplett comments. * Use a new email address to send the RFC :-) This RFC tries to explore how to support filesystem operations inside user namespace using only VFS and a per mount namespace solution. This allows to take advantage of user namespace separations without introducing any change at the filesystems level. All this is handled with the virtual view of mount namespaces. 1) Presentation: ================ The main aim is to support portable root filesystems and allow containers, virtual machines and other cases to use the same root filesystem. Due to security reasons, filesystems can't be mounted inside user namespaces, and mounting them outside will not solve the problem since they will show up with the wrong UIDs/GIDs. Read and write operations will also fail and so on. The current userspace solution is to automatically chown the whole root filesystem before starting a container, example: (host) init_user_ns 1000000:1065536 => (container) user_ns_X1 0:65535 (host) init_user_ns 2000000:2065536 => (container) user_ns_Y1 0:65535 (host) init_user_ns 3000000:3065536 => (container) user_ns_Z1 0:65535 ... Every time a chown is called, files are changed and so on... This prevents to have portable filesystems where you can throw anywhere and boot. Having an extra step to adapt the filesystem to the current mapping and persist it will not allow to verify its integrity, it makes snapshots and migration a bit harder, and probably other limitations... It seems that there are multiple ways to allow user namespaces combine nicely with filesystems, but none of them is that easy. The bind mount and pin the user namespace during mount time will not work, bind mounts share the same super block, hence you may endup working on the wrong vfsmount context and there is no easy way to get out of that... Using the user namespace in the super block seems the way to go, and there is the "Support fuse mounts in user namespaces" [1] patches which seem nice but perhaps too complex!? there is also the overlayfs solution, and finaly the VFS layer solution. We present here a simple VFS solution, everything is packed inside VFS, filesystems don't need to know anything (except probably XFS, and special operations inside union filesystems). Currently it supports ext4, btrfs and overlayfs. Changes into filesystems are small, just parse the vfs_shift_uids and vfs_shift_gids options during mount and set the appropriate flags into the super_block structure. 1) Filesystems don't need the FS_USERNS_MOUNT flag, so no user namespace mounting, they stay secure, nothing changes. 2) The solution is based on VFS and mount namespaces, we use the user namespace of the containing mount namespace to check if we should shift UIDs/GIDs from/to virtual <=> on-disk view. If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids" options, and if it shows up inside a mount namespace that supports VFS UIDs/GIDs shifts then during each access we will remap UID/GID either to virtual or to on-disk view using simple helper functions to allow the access. In case the mount or current mount namespace do not support VFS UID/GID shifts, we fallback to the old behaviour, no shift is performed. 3) The existing user namespace interface is the one used to do the translation from virtual to on-disk mapping. 3) inodes will always keep their original values which reflect the mapping inside init_user_ns which we consider the on-disk mapping. 3.1) During access we map to the virtual view, and if the inode->{i_uid|i_gid} do not have a mapping in the mount namespace we construct one for them. 3.2) For on-disk write we construct the appropriate kuid/kgid that should be stored on-disk. If they have a mapping in the mount namespace we use the corresponding uid_t/gid_t values of that mapping inside the mount namespace and construct the kuid from the pair init_user_ns and uid_t. This covers cases where the mapping inside should be the one stored into on-disk. Now If they don't have a mapping in the mount namespace, we fallback to the old behaviour, the global kuid inside init_user_ns is the one used to update the inode->i_uid. As an example if the mapping 0:65535 inside mount namespace and outside is 1000000:1065536, then 0:65535 will be the range that we use to construct UIDs/GIDs mapping into init_user_ns and use it for on-disk data. They represent the persistent values that we want to write to the disk. Therefore, we don't keep track of any UID/GID shift that was applied before, it gives portability and allows to use the previous mapping which was freed for another root filesystem... If the mapping inside the mount namespace is 1000:65535 and outside is 2000:65535 then the range used to construct UIDs/GIDs mapping to update inode->{i_uid|i_gid} will be the one inside the container, we always use that one to construct the kuid/kgid from uid_t/gid_t and init_user_ns. $ cat /proc/self/uid_map 1000 2000 65536 $ stat -c '%u:%g' mountpoint/etc/fedora-release 65534:65534 $ stat -c '%u:%g' mountpoint/home/tixxdz/ 1000:1000 $ touch mountpoint/newuser touch: cannot touch ‘mountpoint/newuser’: Permission denied $ stat -c '%u:%g' mountpoint/home/tixxdz/newuser 1000:1000 [ outside of namespaces] $ stat -c '%u:%g' mountpoint/home/tixxdz/newuser 1000:1000 Please note that the range here is not hardcoded to 65535, it can be any value set by the creator of the user namespace. These patches use the only interface user namespaces provide. 2**16 was used here to just show how filesystems can be made portable by making the most used UIDs/GIDs available inside containers. Simple demo overlayfs, and btrfs mounted with vfs_shift_uids and vfs_shift_gids. The overlayfs mounts will share the same upperdir. We create two user namesapces every one with its own mapping and where container-uid-2000000 will pull changes from container-uid-1000000 upperdir automatically. [tixxdz@fedora-kvm btrfs_root]$ mount | grep btrfs /dev/mapper/fedora-btrfs_root on /mnt/btrfs_root type btrfs (rw,relatime,seclabel,space_cache,vfs_shift_uids,vfs_shift_gids,subvolid=5,subvol=/) [tixxdz@fedora-kvm btrfs_root]$ sudo mount -t overlay overlay \ -o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=/mnt/btrfs_root/container-uid-1000000/upperdir,workdir=/mnt/btrfs_root/container-uid-1000000/workdir \ /mnt/btrfs_root/container-uid-1000000/merged [tixxdz@fedora-kvm btrfs_root]$ sudo mount -t overlay overlay \ -o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=/mnt/btrfs_root/container-uid-1000000/upperdir,workdir=/mnt/btrfs_root/container-uid-2000000/workdir \ /mnt/btrfs_root/container-uid-2000000/merged [tixxdz@fedora-kvm btrfs_root]$ sudo chown -R 1000000.1000000 /mnt/btrfs_root/container-uid-1000000/workdir/work/ [tixxdz@fedora-kvm btrfs_root]$ sudo chown -R 2000000.2000000 /mnt/btrfs_root/container-uid-2000000/workdir/work/ [ Term 1 ] [tixxdz@fedora-kvm container-uid-1000000]$ sudo ~/bin/mountns-uidshift -u 1000000 bash: /root/.bashrc: Permission denied bash-4.3# cat /proc/self/uid_map 0 1000000 65536 bash-4.3# touch container-uid-1000000/merged/rootfile bash-4.3# stat -c '%u:%g' container-uid-1000000/merged/rootfile 0:0 [ Term 2 ] [tixxdz@fedora-kvm btrfs_root]$ sudo ~/bin/mountns-uidshift -u 2000000 [sudo] password for tixxdz: bash: /root/.bashrc: Permission denied bash-4.3# cat /proc/self/uid_map 0 2000000 65536 bash-4.3# stat -c '%u:%g' container-uid-2000000/merged/rootfile 0:0 [ Term 3 ] (outside of all namespaces) [tixxdz@fedora-kvm btrfs_root]$ stat -c '%u:%g' container-uid-1000000/upperdir/rootfile 0:0 This means that root in user namespace or inside containers is able to write inodes with uid/gid == 0 into disk. This may sound strange and dangerous, yes of course, care must be taken, this way we have added the following: 1) Filesystems when mounted must explicitly support "vfs_shift_uids" and "vfs_shift_gids", we don't require mounting inside user namespaces. 2) Containers or mounts can have their parent directory as 0700, and even before mounting clean the mount namespace, set the appropriate propagation flags and so on... 3) To be able to set the CLONE_MNTNS_SHIFT_UIDGID flag on the new mount namespace either caller has to be real root in init_user_ns, or the parent of the new mount namespace has already that flag set. This allows nesting which I discussed briefly with Serge Hallyn, and he suggested that this should be supported. Preventing nesting is doomed to fail. This way we have security and nesting at the same time. Of course if you clean that flag you won't be able to set it next time only if you are capable in init_user_ns. 4) If the mount namespaces has the flag CLONE_MNTNS_SHIFT_UIDGID set but the filesystem was mounted without "vfs_shift_uids" and "vfs_shift_gids" or does not support these options, then no shifting is performed. You have to meet the two conditions at each access, otherwise we fallback to current behaviour. 5) Only the creator of the mount namespace or one with similar privileges is able to change the mapping rules of the user namespace of that mount namespace. This ensures that only a more privileged is able to change the mapping and at the same time it gives some flexibility since the rules can be changed, and we never persist the virtual UIDs/GIDs into disk, only the view in init_user_ns is always stored into disk. To complete this solution the current blocker is: since we need a way to control mount namespaces we need a new flag, however all flags of current clone() syscall are consumed, yes 32bits no luck! In this RFC I didn't include a new syscall clone4() [2] which was already requested in the past, and the patches for a new clone4() are already there. This way this RFC stays minimal. The flag we use here is just for demonstration, please see patch 0001 and the program mountns-uidshift.c [3] for that. Future versions will include the new clone4() syscall. 2) TEST: ======== Apply on top of Linux 4.6-rc6 HEAD 04974df8049fc4240d2275, and use this program mountns-uidshift.c to test the shifted mount namespaces. https://raw.githubusercontent.com/OpenDZ/research/master/kernel/mountns-uidshift.c With current mapping rules init_user_ns: [1000000:1065536] => new_user_ns: [0:65536] # cat /proc/self/uid_map 0 1000000 65536 # cat /proc/self/gid_map 0 1000000 65536 2.1) ext4: ========== Setup: / on ext4 without vfs_shift_uids, vfs_shift_gids /mnt/ext4_root on ext4 with vfs_shift_uids, vfs_shift_gids /mnt/ext4_root/rootfs/fedore-tree (Another fedora rootfs) /mnt/ext4_root/container-uid-1000000 (container files with uid 1000000) /mnt/ext4_root/container-uid-1000000/mountpoint (bind mount of fedora-tree) $ sudo mount -t ext4 -ovfs_shift_uids,vfs_shift_gids \ /dev/fedora/ext4_root /mnt/ext4_root/ $ mount | grep ext4 - /dev/mapper/fedora-root on / type ext4 (rw,relatime,seclabel,data=ordered) /dev/sda1 on /boot type ext4 (rw,relatime,seclabel,data=ordered) /dev/mapper/fedora-ext4_root on /mnt/ext4_root type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids) $ sudo mkdir /mnt/ext4_root/rootfs/ $ sudo yum -y --releasever=23 --installroot=/mnt/ext4_root/rootfs/fedora-tree \ --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim $ sudo mkdir /mnt/ext4_root/container-uid-1000000/ $ sudo mkdir /mnt/ext4_root/container-uid-1000000/mountpoint $ sudo chown -R 1000000.1000000 /mnt/ext4_root/container-uid-1000000/ $ sudo mount --bind -ovfs_shift_uids,vfs_shift_gids \ /mnt/ext4_root/rootfs/fedora-tree/ mountpoint/ $ mount | grep vfs_shift_uids - /dev/mapper/fedora-ext4_root on /mnt/ext4_root type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids) /dev/mapper/fedora-ext4_root on /mnt/ext4_root/container-uid-1000000/mountpoint type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids) $ sudo ~/bin/mountns-uidshift -u 1000000 ... bash-4.3# id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 bash-4.3# cat /proc/self/uid_map 0 1000000 65536 bash-4.3# mount | grep shift - /dev/mapper/fedora-ext4_root on /mnt/ext4_root type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids) /dev/mapper/fedora-ext4_root on /mnt/ext4_root/container-uid-1000000/mountpoint type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids) bash-4.3# stat -c '%u:%g' /etc/motd 65534:65534 bash-4.3# stat -c '%u:%g' /mnt/ext4_root/rootfs/fedora-tree/etc/motd 0:0 bash-4.3# stat -c '%u:%g' mountpoint/etc/motd 0:0 bash-4.3# stat -c '%u:%g' /etc/machine-id 65534:65534 bash-4.3# echo "blabla" > /etc/machine-id bash: /etc/machine-id: Permission denied bash-4.3# stat -c '%u:%g' mountpoint/etc/machine-id 0:0 bash-4.3# sha1sum mountpoint/etc/machine-id edb24591988f0f003cd397704f49e92208b3015f mountpoint/etc/machine-id bash-4.3# m=$(cat /dev/urandom | tr -cd 'a-f0-9' | head -c 32); echo $m | sha1sum; echo $m > mountpoint/etc/machine-id f256a796b1f2ed09c4107f1f5aff2568fb2d79cc - bash-4.3# sha1sum mountpoint/etc/machine-id f256a796b1f2ed09c4107f1f5aff2568fb2d79cc mountpoint/etc/machine-id bash-4.3# stat -c '%u:%g' mountpoint/etc/machine-id 0:0 [outside of namespaces]$ stat -c '%u:%g' /mnt/ext4_root/container-uid-1000000/mountpoint/etc/machine-id 0:0 Test with unprivileged user inside the new mount and user namespaces: --------------------------------------------------------------------- Test with uid tixxdz == 1000, the user exists on both: (1) / (2) /mnt/ext4_root/rootfs/fedore-tree which is bind mounted into /mnt/ext4_root/container-uid-1000000/mountpoint -bash-4.3$ touch /home/tixxdz/newfile touch: cannot touch /home/tixxdz/newfile: Permission denied -bash-4.3$ stat -c '%u:%g' /home/tixxdz/ 65534:65534 -bash-4.3$ stat -c '%u:%g' mountpoint/home/tixxdz/ 1000:1000 -bash-4.3$ touch mountpoint/home/tixxdz/newfile -bash-4.3$ stat -c '%u:%g' mountpoint/home/tixxdz/newfile 1000:1000 [outside of namespaces]$ stat -c '%u:%g' /mnt/ext4_root/container-uid-1000000/mountpoint/home/tixxdz/newfile 1000:1000 2.2) btrfs: =========== Same steps as ext4. 2.3) overlayfs: =============== 2.3.1) Native support using VFS: Overlayfs is natively supported if lowerdir, upperdir and workdir are all on a mount that supports vfs_shift_uids and vfs_shift_gids flags and we are in a mount namespace that also supports that. $ mount | grep btrfs /dev/mapper/fedora-btrfs_root on /mnt/btrfs_root type btrfs (rw,relatime,seclabel,space_cache,vfs_shift_uids,vfs_shift_gids,subvolid=5,subvol=/) $ cd /mnt/btrfs_root/ $ sudo mkdir -p container-uid-2000000/{upperdir,workdir,merged} $ sudo chown -R 2000000.2000000 container-uid-2000000/ $ cd container-uid-2000000/ $ sudo mount -t overlay overlay -o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=upperdir,workdir=workdir merged $ sudo chown -R 2000000.2000000 workdir/work/ $ sudo ~/bin/mountns-uidshift -u 2000000 ... bash-4.3# stat -c '%u:%g' merged/etc/passwd 0:0 bash-4.3# touch merged/overlayfs-file bash-4.3# stat -c '%u:%g' merged/overlayfs-file 0:0 [outside of namespaces]# stat -c '%u:%g' /mnt/btrfs_root/container-uid-2000000/merged/overlayfs-file 0:0 [outside of namespaces]# stat -c '%u:%g' /mnt/btrfs_root/container-uid-2000000/upperdir/overlayfs-file 0:0 2.3.2) Complex support or union filesystems: If overlayfs lowerdir and upperdir are not on a filesystem that supports natively vfs_shift_uids and vfs_shift_gids then to support VFS UID/GID shifts, we must adapt the helper functions that where introduced in this series to take also a super_block struct and test if the appropriate flags where set into overlayfs instead of the other filesystem which the inode belongs to. The translation on-disk <=> virtual should happen then inside overlayfs. I think this will always be the case of union mounts which fetch an inode from another mount. I think that solution (2.3.2) can also be implemented, I had some ugly patches to implement this on top of overlayfs, but not sure, better see what others think about VFS UID/GID shifts first. IMO solution (2.3.1) if done correctly is the way to go, in the end all this relates to the virtual view of UID/GID inside the kernel, and how resources are translated to them, it's not related to overlayfs. 3) ROADMAP: =========== * Confirm current design, and make sure that the mapping is done correctly. * Add clone4() syscall [2] * Investigate if current setns() checks to enter new mount namespaces are sufficient ? * Add POSIX ACL support ? * Check if all filesystem operations are correctly supported and recheck permissions access. * Do filesystems provide some operations to control disk or host resources ? in other words are there some inodes on filesystems that allow to access host resources, if so then maybe these inodes either should be marked only safe in init_user_ns or get the appropriate capable() in init_user_ns if missing. Needs investigation. * Add XFS support. References: =========== [1] https://www.redhat.com/archives/dm-devel/2016-April/msg00368.html [2] https://lkml.org/lkml/2015/3/15/10 [3] https://raw.githubusercontent.com/OpenDZ/research/master/kernel/mountns-uidshift.c Thanks! Patches: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems [RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs [RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs to virtual view [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid [RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission access [RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk view [RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write to disk [RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids mount options [RFC v2 PATCH 8/8] btrfs: add support for vfs_shift_uids and vfs_shift_gids mount options Diffstat for this RFC fs/attr.c | 44 +++++++++++++++++++++++-------- fs/btrfs/super.c | 15 ++++++++++- fs/exec.c | 2 +- fs/ext4/super.c | 14 ++++++++++ fs/inode.c | 9 ++++--- fs/mount.h | 1 + fs/namei.c | 6 +++-- fs/namespace.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/stat.c | 4 +-- include/linux/fs.h | 14 ++++++++++ include/linux/mount.h | 1 + include/linux/user_namespace.h | 8 ++++++ include/uapi/linux/sched.h | 1 + kernel/capability.c | 14 ++++++++-- kernel/fork.c | 4 +++ kernel/user_namespace.c | 13 ++++++++++ security/commoncap.c | 2 +- security/selinux/hooks.c | 2 +- 18 files changed, 319 insertions(+), 25 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html