Aleksa Sarai <cyphar@xxxxxxxxxx> writes: > On 2020-10-29, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: >> Christian Brauner <christian.brauner@xxxxxxxxxx> writes: >> >> > Hey everyone, >> > >> > I vanished for a little while to focus on this work here so sorry for >> > not being available by mail for a while. >> > >> > Since quite a long time we have issues with sharing mounts between >> > multiple unprivileged containers with different id mappings, sharing a >> > rootfs between multiple containers with different id mappings, and also >> > sharing regular directories and filesystems between users with different >> > uids and gids. The latter use-cases have become even more important with >> > the availability and adoption of systemd-homed (cf. [1]) to implement >> > portable home directories. >> >> Can you walk us through the motivating use case? >> >> As of this year's LPC I had the distinct impression that the primary use >> case for such a feature was due to the RLIMIT_NPROC problem where two >> containers with the same users still wanted different uid mappings to >> the disk because the users were conflicting with each other because of >> the per user rlimits. >> >> Fixing rlimits is straight forward to implement, and easier to manage >> for implementations and administrators. > > This is separate to the question of "isolated user namespaces" and > managing different mappings between containers. This patchset is solving > the same problem that shiftfs solved -- sharing a single directory tree > between containers that have different ID mappings. rlimits (nor any of > the other proposals we discussed at LPC) will help with this problem. First and foremost: A uid shift on write to a filesystem is a security bug waiting to happen. This is especially in the context of facilities like iouring, that play very agressive games with how process context makes it to system calls. The only reason containers were not immediately exploitable when iouring was introduced is because the mechanisms are built so that even if something escapes containment the security properties still apply. Changes to the uid when writing to the filesystem does not have that property. The tiniest slip in containment will be a security issue. This is not even the least bit theoretical. I have seem reports of how shitfs+overlayfs created a situation where anyone could read /etc/shadow. If you are going to write using the same uid to disk from different containers the question becomes why can't those containers configure those users to use the same kuid? What fixing rlimits does is it fixes one of the reasons that different containers could not share the same kuid for users that want to write to disk with the same uid. I humbly suggest that it will be more secure, and easier to maintain for both developers and users if we fix the reasons people want different containers to have the same user running with different kuids. If not what are the reasons we fundamentally need the same on-disk user using multiple kuids in the kernel? Eric