On Thu, Oct 29, 2020 at 5:37 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > > Aleksa Sarai <cyphar@xxxxxxxxxx> writes: > > > On 2020-10-29, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > >> Christian Brauner <christian.brauner@xxxxxxxxxx> writes: > >> > >> > Hey everyone, > >> > > >> > I vanished for a little while to focus on this work here so sorry for > >> > not being available by mail for a while. > >> > > >> > Since quite a long time we have issues with sharing mounts between > >> > multiple unprivileged containers with different id mappings, sharing a > >> > rootfs between multiple containers with different id mappings, and also > >> > sharing regular directories and filesystems between users with different > >> > uids and gids. The latter use-cases have become even more important with > >> > the availability and adoption of systemd-homed (cf. [1]) to implement > >> > portable home directories. > >> > >> Can you walk us through the motivating use case? > >> > >> As of this year's LPC I had the distinct impression that the primary use > >> case for such a feature was due to the RLIMIT_NPROC problem where two > >> containers with the same users still wanted different uid mappings to > >> the disk because the users were conflicting with each other because of > >> the per user rlimits. > >> > >> Fixing rlimits is straight forward to implement, and easier to manage > >> for implementations and administrators. > > > > This is separate to the question of "isolated user namespaces" and > > managing different mappings between containers. This patchset is solving > > the same problem that shiftfs solved -- sharing a single directory tree > > between containers that have different ID mappings. rlimits (nor any of > > the other proposals we discussed at LPC) will help with this problem. > > First and foremost: A uid shift on write to a filesystem is a security > bug waiting to happen. This is especially in the context of facilities > like iouring, that play very agressive games with how process context > makes it to system calls. > > The only reason containers were not immediately exploitable when iouring > was introduced is because the mechanisms are built so that even if > something escapes containment the security properties still apply. > Changes to the uid when writing to the filesystem does not have that > property. The tiniest slip in containment will be a security issue. > > This is not even the least bit theoretical. I have seem reports of how > shitfs+overlayfs created a situation where anyone could read > /etc/shadow. > > If you are going to write using the same uid to disk from different > containers the question becomes why can't those containers configure > those users to use the same kuid? > > What fixing rlimits does is it fixes one of the reasons that different > containers could not share the same kuid for users that want to write to > disk with the same uid. > > > I humbly suggest that it will be more secure, and easier to maintain for > both developers and users if we fix the reasons people want different > containers to have the same user running with different kuids. > > If not what are the reasons we fundamentally need the same on-disk user > using multiple kuids in the kernel? I would like to use this patch set in the context of Kubernetes. I described my two possible setups in https://www.spinics.net/lists/linux-containers/msg36537.html: 1. Each Kubernetes pod has its own userns but with the same user id mapping 2. Each Kubernetes pod has its own userns with non-overlapping user id mapping (providing additional isolation between pods) But even in the setup where all pods run with the same id mappings, this patch set is still useful to me for 2 reasons: 1. To avoid the expensive recursive chown of the rootfs. We cannot necessarily extract the tarball directly with the right uids because we might use the same container image for privileged containers (with the host userns) and unprivileged containers (with a new userns), so we have at least 2 “mappings” (taking more time and resulting in more storage space). Although the “metacopy” mount option in overlayfs helps to make the recursive chown faster, it can still take time with large container images with lots of files. I’d like to use this patch set to set up the root fs in constant time. 2. To manage large external volumes (NFS or other filesystems). Even if admins can decide to use the same kuid on all the nodes of the Kubernetes cluster, this is impractical for migration. People can have existing Kubernetes clusters (currently without using user namespaces) and large NFS volumes. If they want to switch to a new version of Kubernetes with the user namespace feature enabled, they would need to recursively chown all the files on the NFS shares. This could take time on large filesystems and realistically, we want to support rolling updates where some nodes use the previous version without user namespaces and new nodes are progressively migrated to the new userns with the new id mapping. If both sets of nodes use the same NFS share, that can’t work. Alban