And re-sending, this time hopefully actually in plain text mode. Sorry about that, my e-mail client isn't behaving today... Stéphane On Mon, Feb 17, 2020 at 4:57 PM Stéphane Graber <stgraber@xxxxxxxxxx> wrote: > > On Mon, Feb 17, 2020 at 4:12 PM James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: >> >> On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: >> [...] >> > With this patch series we simply introduce the ability to create fsid >> > mappings that are different from the id mappings of a user namespace. >> > The whole feature set is placed under a config option that defaults >> > to false. >> > >> > In the usual case of running an unprivileged container we will have >> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will >> > correspond to this id mapping, i.e. all files which we want to appear >> > as 0:0 inside the user namespace will be chowned to 100000:100000 on >> > the host. This works, because whenever the kernel needs to do a >> > filesystem access it will lookup the corresponding uid and gid in the >> > idmapping tables of the container. >> > Now think about the case where we want to have an id mapping of 0 >> > 100000 100000 but an on-disk mapping of 0 300000 100000 which is >> > needed to e.g. share a single on-disk mapping with multiple >> > containers that all have different id mappings. >> > This will be problematic. Whenever a filesystem access is requested, >> > the kernel will now try to lookup a mapping for 300000 in the id >> > mapping tables of the user namespace but since there is none the >> > files will appear to be owned by the overflow id, i.e. usually >> > 65534:65534 or nobody:nogroup. >> > >> > With fsid mappings we can solve this by writing an id mapping of 0 >> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem >> > access the kernel will now lookup the mapping for 300000 in the fsid >> > mapping tables of the user namespace. And since such a mapping >> > exists, the corresponding files will have correct ownership. >> >> How do we parametrise this new fsid shift for the unprivileged use >> case? For newuidmap/newgidmap, it's easy because each user gets a >> dedicated range and everything "just works (tm)". However, for the >> fsid mapping, assuming some newfsuid/newfsgid tool to help, that tool >> has to know not only your allocated uid/gid chunk, but also the offset >> map of the image. The former is easy, but the latter is going to vary >> by the actual image ... well unless we standardise some accepted shift >> for images and it simply becomes a known static offset. > > > For unprivileged runtimes, I would expect images to be unshifted and be > unpacked from within a userns. So your unprivileged user would be allowed > a uid/gid range through /etc/subuid and /etc/subgid and allowed to use > them through newuidmap/newgidmap.In that namespace, you can then pull > and unpack any images/layers you may want and the resulting fs tree will > look correct from within that namespace. > > All that is possible today and is how for example unprivileged LXC works > right now. > > What this patchset then allows is for containers to have differing > uid/gid maps while still being based off the same image or layers. > In this scenario, you would carve a subset of your main uid/gid map for > each container you run and run them in a child user namespace while > setting up a fsuid/fsgid map such that their filesystem access do not > follow their uid/gid map. This then results in proper isolation for > processes, networks, ... as everything runs as different kuid/kgid but > the VFS view will be the same in all containers. > > Shared storage between those otherwise isolated containers would also > work just fine by simply bind-mounting the same path into two or more > containers. > > > Now one additional thing that would be safe for a setuid wrapper to > allow would be for arbitrary mapping of any of the uid/gid that the user > owns to be used within the fsuid/fsgid map. One potential use for this > would be to create any number of user namespaces, each with their own > mapping for uid 0 while still having all VFS access be mapped to the > user that spawned them (say uid=1000, gid=1000). > > > Note that in our case, the intended use for this is from a privileged runtime > where our images would be unshifted as would be the container storage > and any shared storage for containers. The security model effectively relying > on properly configured filesystem permissions and mount namespaces such > that the content of those paths can never be seen by anyone but root outside > of those containers (and therefore avoids all the issues around setuid/setgid/fscaps). > > We will then be able to allocate distinct, random, ranges of 65536 uids/gids (or more) > for each container without ever having to do any uid/gid shifting at the filesystem layer > or run into issues when having to setup shared storage between containers or attaching > external storage volumes to those containers. > >> James > > > Stéphane