On Sun, Jun 09, 2024 at 03:43:33AM -0700, Jonathan Calmels wrote: > This patch series introduces a new user namespace capability set, as > well as some plumbing around it (i.e. sysctl, secbit, lsm support). > > First patch goes over the motivations for this as well as prior art. > > In summary, while user namespaces are a great success today in that they > avoid running a lot of code as root, they also expand the attack surface > of the kernel substantially which is often abused by attackers. > Methods exist to limit the creation of such namespaces [1], however, > application developers often need to assume that user namespaces are > available for various tasks such as sandboxing. Thus, instead of > restricting the creation of user namespaces, we offer ways for userspace > to limit the capabilities granted to them. > > Why a new capability set and not something specific to the userns (e.g. > ioctl_ns)? > > 1. We can't really expect userspace to patch every single callsite > and opt-in this new security mechanism. > > 2. We don't necessarily want policies enforced at said callsites. > For example a service like systemd-machined or a PAM session need to > be able to place restrictions on any namespace spawned under it. > > 3. We would need to come up with inheritance rules, querying > capabilities, etc. At this point we're just reinventing capability > sets. > > 4. We can easily define interactions between capability sets, thus > helping with adoption (patch 2 is an example of this) > > Some examples of how this could be leveraged in userspace: > > - Prevent user from getting CAP_NET_ADMIN in user namespaces under SSH: > echo "auth optional pam_cap.so" >> /etc/pam.d/sshd > echo "!cap_net_admin $USER" >> /etc/security/capability.conf > capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd > > - Prevent containers from ever getting CAP_DAC_OVERRIDE: > systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \ > -p SecureBits=userns-strict-caps \ > /usr/bin/dockerd > systemd-run -p UserNSCapabilities=~CAP_DAC_OVERRIDE \ > /usr/bin/incusd > > - Kernel could be vulnerable to CAP_SYS_RAWIO exploits, prevent it: > sysctl -w cap_bound_userns_mask=0x1fffffdffff > > - Drop CAP_SYS_ADMIN for this shell and all the user namespaces below it: > bwrap --unshare-user --cap-drop CAP_SYS_ADMIN /bin/sh > Where are the tests for this patchset? I see you updated the bpf tests for the bpf lsm bits, but there's nothing to validate this new behavior or exercise the new ioctl you've added. Thanks, Josef