Re: [PATCH 00/34] fs: idmapped mounts

Stéphane Graber <stgraber@xxxxxxxxxx> · Thu, 29 Oct 2020 14:04:31 -0400

On Thu, Oct 29, 2020 at 12:45 PM Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
>
> Tycho Andersen <tycho@tycho.pizza> writes:
>
> > Hi Eric,
> >
> > On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@xxxxxxxxxx> writes:
> >>
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >>
> >> Can you walk us through the motivating use case?
> >>
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >>
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > Our use case is to have the same directory exposed to several
> > different containers which each have disjoint ID mappings.
>
> Why do the you have disjoint ID mappings for the users that are writing
> to disk with the same ID?
>
> >> Reading up on systemd-homed it appears to be a way to have encrypted
> >> home directories.  Those home directories can either be encrypted at the
> >> fs or at the block level.  Those home directories appear to have the
> >> goal of being luggable between systems.  If the systems in question
> >> don't have common administration of uids and gids after lugging your
> >> encrypted home directory to another system chowning the files is
> >> required.
> >>
> >> Is that the use case you are looking at removing the need for
> >> systemd-homed to avoid chowning after lugging encrypted home directories
> >> from one system to another?  Why would it be desirable to avoid the
> >> chown?
> >
> > Not just systemd-homed, but LXD has to do this,
>
> I asked why the same disk users are assigned different kuids and the
> only reason I have heard that LXD does this is the RLIMIT_NPROC problem.
>
> Perhaps there is another reason.
>
> In part this is why I am eager to hear peoples use case, and why I was
> trying very hard to make certain we get the requirements.
>
> I want the real requirements though and some thought, not just we did
> this and it hurts.  Changning the uids on write is a very hard problem,
> and not just in implementating it but also in maintaining and
> understanding what is going on.

The most common cases where shiftfs is used or where folks would like
to use it today are (by importance):
 - Fast container creation (by not having to uid/gid shift all files
in the downloaded image)
 - Sharing data between the host system and a container (some paths
under /home being the most common)
 - Sharing data between unprivileged containers with a disjointed map
 - Sharing data between multiple containers, some privileged, some unprivileged

Fixing the ulimit issue only takes care of one of those (3rd item), it
does not solve any of the other cases.

The first item on there alone can be quite significant. Creation and
startup of a regular Debian container on my system takes around 500ms
when shiftfs is used (btrfs/lvm/zfs copy-on-write clone of the image,
setup shiftfs, start container) compared to 2-3s when running without
it (same clone, followed by rewrite of all uid/gid present on the fs,
including acls and capabilities, then start container). And that's on
a fast system with an NVME SSD and a small rootfs. We have had reports
of a few users running on slow spinning rust with large containers
where shifting can take several minutes.

The second item can technically be worked around without shifted
bind-mounts by doing userns map hole punching, mapping the user's
uid/gid from the host straight into the container. The downside to
this is that another shifting pass becomes needed for any file outside
of the bind-mounted path (or it would become owned by -1/-1) and it's
very much not dynamic, requiring the container be stopped, config
updated by the user, /etc/subuid and subgid maps being updated and
container started back up. If you need another user/group be exposed,
start all over again...
This is far more complex, slow and disruptive than the shifted
approach where we just need to do:
   lxc config device add MY-CONTAINER home disk source=/home
path=/home shift=true
To inject a new mount of /home from the host into the container with a
shifting layer in place, no need to reconfig subuid/subgid, no need to
re-create the userns to update the mapping and no need to go through
the container's rootfs for any file which may now need remapping
because of the map change.

Stéphane

> Eric
> _______________________________________________
> Containers mailing list
> Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
> https://lists.linuxfoundation.org/mailman/listinfo/containers