On Thu, Jan 21, 2021 at 02:19:19PM +0100, Christian Brauner wrote: > Hey everyone, > > The only major change is the updated version of hch's pach to port xfs > to support idmapped mounts. Thanks again to Christoph for doing that > work. > (Otherwise Acked-bys and Reviewed-bys were added and the tree reordered > to decouple filesystem specific conversion from the vfs work so they > can proceed independent. > For a full list of major changes between versions see the end of this > cover letter. Please also note the large xfstests testsuite in patch 42 > that has been kept as part of this series. It verifies correct vfs > behavior with and without idmapped mounts including covering newer vfs > features such as io_uring. > I currently still plan to target the v5.12 merge window.) > > With this patchset we make it possible to attach idmappings to mounts, > i.e. simply put different bind mounts can expose the same file or > directory with different ownership. > Shifting of ownership on a per-mount basis handles a wide range of > long standing use-cases. Here are just a few: > - Shifting of a subset of ownership-less filesystems (vfat) for use by > multiple users, effectively allowing for DAC on such devices > (systemd, Android, ...) > - Allow remapping uid/gid on external filesystems or paths (USB sticks, > network filesystem, ...) to match the local system's user and groups. > (David Howells intends to port AFS as a first candidate.) > - Shifting of a container rootfs or base image without having to mangle > every file (runc, Docker, containerd, k8s, LXD, systemd ...) > - Sharing of data between host or privileged containers with > unprivileged containers (runC, Docker, containerd, k8s, LXD, ...) > - Data sharing between multiple user namespaces with incompatible maps > (LXD, k8s, ...) > > There has been significant interest in this patchset as evidenced by > user commenting on previous version of this patchset. They include > containerd, ChromeOS, systemd, LXD and a range of others. There is > already a patchset up for containerd, the default Kubernetes container > runtime https://github.com/containerd/containerd/pull/4734 > to make use of this. systemd intends to use it in their systemd-homed > implementation for portable home directories. ChromeOS wants to make use > of it to share data between the host and the Linux containers they run > on Chrome- and Pixelbooks. There's also a few talks that of people who > are going to make use of this. The most recent one was a CNCF webinar > https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf > and upcoming talk during FOSDEM. > (Fwiw, for fun and since I wanted to do this for a long time I've ported > my home directory to be completely portable with a simple service file > that now mounts my home directory on an ext4 formatted usb stick with > an id mapping mapping all files to the random uid I'm assigned at > login.) > > Making it possible to share directories and mounts between users with > different uids and gids is itself quite an important use-case in > distributed systems environments. It's of course especially useful in > general for portable usb sticks, sharing data between multiple users in, > and sharing home directories between multiple users. The last example is > now elegantly expressed in systemd's homed concept for portable home > directories. As mentioned above, idmapped mounts also allow data from > the host to be shared with unprivileged containers, between privileged > and unprivileged containers simultaneously and in addition also between > unprivileged containers with different idmappings whenever they are used > to isolate one container completely from another container. > > We have implemented and proposed multiple solutions to this before. This > included the introduction of fsid mappings, a tiny filesystem I've > authored with Seth Forshee that is currently carried in Ubuntu that has > shown to be the wrong approach, and the conceptual hack of calling > override creds directly in the vfs. In addition, to some of these > solutions being hacky none of these solutions have covered all of the > above use-cases. > > Idmappings become a property of struct vfsmount instead of tying it to a > process being inside of a user namespace which has been the case for all > other proposed approaches. It also allows to pass down the user > namespace into the filesystems which is a clean way instead of violating > calling conventions by strapping the user namespace information that is > a property of the mount to the caller's credentials or similar hacks. > Each mount can have a separate idmapping and idmapped mounts can even be > created in the initial user namespace unblocking a range of use-cases. > > To this end the vfsmount struct gains a new struct user_namespace > member. The idmapping of the user namespace becomes the idmapping of the > mount. A caller that is privileged with respect to the user namespace of > the superblock of the underlying filesystem can create an idmapped > mount. In the future, we can enable unprivileged use-cases by checking > whether the caller is privileged wrt to the user namespace that an > already idmapped mount has been marked with, allowing them to change the > idmapping. For now, keep things simple until the need arises. > Note, that with syscall interception it is already possible to intercept > idmapped mount requests from unprivileged containers and handle them in > a sufficiently privileged container manager. Support for this is already > available in LXD and will be available in runC where syscall > interception is currently in the process of becoming part of the runtime > spec: https://github.com/opencontainers/runtime-spec/pull/1074. > > The user namespace the mount will be marked with can be specified by > passing a file descriptor refering to the user namespace as an argument > to the new mount_setattr() syscall together with the new > MOUNT_ATTR_IDMAP flag. By default vfsmounts are marked with the initial > user namespace and no behavioral or performance changes are observed. > All mapping operations are nops for the initial user namespace. When a > file/inode is accessed through an idmapped mount the i_uid and i_gid of > the inode will be remapped according to the user namespace the mount has > been marked with. > > In order to support idmapped mounts, filesystems need to be changed and > mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The initial > version contains fat, ext4, and xfs including a list of examples. > But patches for other filesystems are actively worked on and will be > sent out separately. We are here to see this through and there are > multiple people involved in converting filesystems. So filesystem > developers are not left alone with this and are provided with a large > testsuite to verify that their port is correct. > > There is a simple tool available at > https://github.com/brauner/mount-idmapped that allows to create idmapped > mounts so people can play with this patch series. Here are a few > illustrations: > > 1. Create a simple idmapped mount of another user's home directory > > u1001@f2-vm:/$ sudo ./mount-idmapped --map-mount b:1000:1001:1 /home/ubuntu/ /mnt > u1001@f2-vm:/$ ls -al /home/ubuntu/ > total 28 > drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . > drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. > -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history > -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout > -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc > -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile > -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful > -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo So I assume this falls under the buyer beware warning, but it's probably important to warn people loudly of the fact that, at this point, the user with uid 1001 can chmod u+s any binary under /mnt and then run it from /home/ubuntu with euid=1000. In other words, that while this has excellent uses, if you *can* use shared group membership, you should :) Very cool though.