Hello everyone, ## Preliminaries This is the summary of the Hackroom session Stéphane and I led as a follow-up to our presentations in the Containers & Checkpoint/Restore micro-conference at Linux Plumbers 2020. Please make sure to see the Action Items section below as it outlines the next concrete steps that came up during the meeting and who seemed interested in tackling them. The background for this summary is: 1. Stéphane's and my talk "Isolated Dynamic User Namespaces" People interested in the full session can watch it on YouTube: https://youtu.be/fSyr_IXM21Y?t=8856 2. The Hackroom session on Wednesday, 25.08.2020 at 17:00 UTC This session has been recorded as well. It is not yet on YouTube because Hackroom sessions weren't streamed. However, I plan on cutting that video and putting it up on YouTube as well just so there's no chance of miscommunication. All people that attended session 1. were asked to send me an e-mail if they wanted to attend session 2. to hash out details. The following people requested to attend session 2. and were informed either through the e-mail I sent out or IRC: Aleksa Sarai Alexander Mihalicyn Andy Lutomirski Christian Brauner Eric W. Biederman Geoffrey Thomas Giuseppe Scrivano Joseph Christopher Sible Josh Triplett Kees Cook Mickaël Salaün Mrunal Patel Pavel Tikhomirov Sargun Dhillon Serge Hallyn Stephane Graber Vivek Goyal Wat Lim All of them should be Cced here. In case I forgot someone don't hesitate to forward this mail to them! ## Summary During the Containers & Checkpoint/Restore micro-conference and in the hackroom session Stéphane Graber and I proposed a way to make using user namespaces simpler and more isolated. The following current problems were identified: P1. Isolated id mappings can only be guaranteed to be locally isolated. A container runtime/daemon can only guarantee non-overlapping id mappings when no other users on the system create containers. P2. Enforcing isolated id mappings in userspace is difficult. It is always possible to create other processes with overlapping id mappings. Coordinating id mappings in userspace will always remain optional. Quite a few tools nowadays (including systemd) don't care about /etc/sub{g,u}id and actively advise against using it. This is made even more problematic since sub{g,u}iid delegation is done per-user rather than per-container-runtime. P3. The range of the id mapping of a container can't be predetermined. While POSIX mandates that a standard system should use a range of 65536 ids reality is very different. Some programs allocate high ids for random processes or for network authentication. This means, in practice it is often necessary to assign a range of up to 10 million ids to a container. This limits a system to less than 500 containers total. P4. Isolated id mappings severely restrict the number of containers that can be run on a system. This ties back to the point about pre-determining the id range of a container and how large range allocations tend to be on real systems. That becomes even more relevant when nesting containers. P5. Container runtimes cannot reuse overlayfs lower directories if each container uses isolated ID mappings, leading to either needless storage overhead (LXD -- though the LXD folks don’t really mind), completely ignoring the benefits of isolating containers from each other (Docker), or not using them at all (Kubernetes). (This is a more general issue but bears repeating since it is closely tied to most userns proposals.) P6. Rlimits pose a problem for containers that share the same id mapping. This means containers with overlapping id mappings can DOS each other by exhausting their rlimits. The reason for this lies with the current implementation of rlimits -- rlimits are currently tied to users and are not hierarchically limited like inotify limits are. This is a severe problem in unprivileged workloads. Eric and others identified that this issue can be fixed independently of the isolated user namespace proposal. In response to these and other issues, we made the following proposal which was floated around in less clear form already during Linux Plumber 2019 in Lisbon during informal discussions: ## Proposal Introduce an in-kernel concept of an isolated user namespace by switching the id types in the kernel from 32 to 64 bits. Userspace will only get to see the lower 32 bits as usual. The upper 32 bits are used for a unique, in-kernel user namespace token. The owner of such a namespace will either be the effective id of the creator of that namespace or optionally an owning id can be set (when created by a privileged user). The following advantages were identified by various people during the session: S1. An isolated user namespace has access to the full 32 bit id range. This makes it compatible with every Linux workload and allows to support post-POSIX range users that allocate high-range ids (LDAP, systemd, etc). This solves P3 and P4. S2. Kernel-enforced user namespace isolation. This means, there is no need for different container runtimes to collaborate on id ranges with immediate benefits for everyone. This solves P1 and P2. S3. The need to split existing id ranges is completely removed. Nested containers become trivial. S4. Simplify the usage of user namespaces significantly for newcomers. This should hopefully finally increase adoption in userspace especially in the application container and Kubernetes space. S5. The owning id concept of a user namespace makes monitoring and interacting with such containers way easier. S6. When interacting with an isolated user namespace the owning id can be used as a credential when interacting with the container from an ancestor user namespace. The need and desire for these features seemed to be expressed by most participating parties. ### Issues Two main issues were discussed: 1. How are interactions across isolated user namespaces handled? An isolated user namespace can interact with another isolated user namespace or an ancestor user namespace. A good example are socket credentials. They can be seen and received outside of the container. In those cases the id of the isolated user namespace needs to be represented. The proposals to solve this problem were: 1.1. Use the owning id of the isolated user namespace. A parent user namespace would see the configured owning id of the isolated user namespace (mapped to that user namespace). A non ancestor user namespace would see the overflow ids. 1.2. Always use the overflow id for isolated user namespaces. Any other user namespaces would see the overflow id configured on the system. Proposal 1.1 semmed prefered since it would allow an unprivileged user creating an isolated user namespace to kill/ptrace all processes in the isolated namespace they spawned. In contrast proposal 1.1 would not allow for visible ownership of the container, not just making tracking down the container harder but also preventing the owner from accessing those processes through other APIs. Related to this proposal it was suggested to introduce a new sysctl which would allow the system administrator to prevent any id transitions to overflow ids, i.e. a process would not be able to set{g,u}id() to the overflow {g,u}id. A distribution can then decide to select specific overflow ids (akin to a system id) and set them via the already existing /proc/sys/kernel/overflow{g,u}id sysctl interfaces. This increases the security that isolated user namespaces provide even more. 2. How is filesystem access in isolated user namespaces handled? (This is basically the problem outlined in P5). There were quite a few proposals pitched by Andy and some others and it would be difficult to summarize them all here, especially since a few of them were rather rudimentary sketches. Once the YouTube video of the Hackroom session is up people can listen to it in more detail. The first consensus reached seemed to be to decouple isolated user namespaces from shiftfs. The idea is to solely rely on tmpfs and fuse at the beginning as filesystems which can be mounted inside isolated user namespaces and so would have proper ownership. For mount points that originate from outside the namespace, everything will show as the overflow ids and access would be restricted to the most restricted permission bit for any path that can be accessed. ### Additional Requirements Sargun pointed out that they make use of NFSv4 both id mapped, and non-id mapped. Different id mappings between different filesystems in NFS is not part of their use-case currently and so it is fine if the ids are passed through as is. He additionally pointed out that they would like to be able use the idmapper tool in such isolated containers. This tool maps a given process id to the highest user id available. It seems that all of these use-cases would work with the current setup. It was proposed that for NFS an alternative solution should be considered, namely making it possible to mount NFS inside of a user namespace. This work would need to be done by someone well-versed in modern NFS. ### Action Items The following consensus seemed to have been reached by the end of the session: 1. Fixing rlimits in user namespaces such that one namespace cannot affect another. This was identified as problem P6 above. During the session it seemed that Eric intended to investigate this. 2. Prototyping switching the kernel uid/gid types to 64bit, allowing for a hidden 32bit identifier and fully usable 32bit uid/gid range for the container. The consensus seemed to have been to implement a first version of this and do performance testing to see what the performance impact of this change would be. Aleksa Sarai and Christian Brauner stated they were interested in taking on this work jointly. 3. Find a way to allow setgroups() in a user namespace while keeping in mind the case of groups used for negative access control. This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to investigate adding a prctl() to allow setgroups() to be called in a user namespace at the cost of restricting paths to the most restrictive permission. So if something is 0707 it needs to be treated as if it's 0000 even though the caller is not in its owning group which is used for negative access control (how these new semantics will interact with ACLs will also need to be looked into). 4. Add optional enforcement that overflow uid/gid as set through sysctl cannot be used as regular uid/gid on the system, which will allow userspace to disambiguate credential IDs which are unmapped versus the “nobody” user (which is actually used by distributions) It seemed that this idea was pitched by Geoffrey Thomas. Special thanks to Stéphane and Aleksa for corrections and additions! Thanks! Christian _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers