On 5/18/2024 5:20 AM, Serge Hallyn wrote: > On Fri, May 17, 2024 at 10:53:24AM -0700, Casey Schaufler wrote: >> On 5/17/2024 4:42 AM, Jonathan Calmels wrote: >>>>>> On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote: >>>>>>> I suggest that adding a capability set for user namespaces is a bad idea: >>>>>>> - It is in no way obvious what problem it solves >>>>>>> - It is not obvious how it solves any problem >>>>>>> - The capability mechanism has not been popular, and relying on a >>>>>>> community (e.g. container developers) to embrace it based on this >>>>>>> enhancement is a recipe for failure >>>>>>> - Capabilities are already more complicated than modern developers >>>>>>> want to deal with. Adding another, special purpose set, is going >>>>>>> to make them even more difficult to use. >>> Sorry if the commit wasn't clear enough. >> While, as others have pointed out, the commit description left >> much to be desired, that isn't the biggest problem with the change >> you're proposing. >> >>> Basically: >>> >>> - Today user namespaces grant full capabilities. >> Of course they do. I have been following the use of capabilities >> in Linux since before they were implemented. The uptake has been >> disappointing in all use cases. >> >>> This behavior is often abused to attack various kernel subsystems. >> Yes. The problems of a single, all powerful root privilege scheme are >> well documented. >> >>> Only option >> Hardly. >> >>> is to disable them altogether which breaks a lot of >>> userspace stuff. >> Updating userspace components to behave properly in a capabilities >> environment has never been a popular activity, but is the right way >> to address this issue. And before you start on the "no one can do that, >> it's too hard", I'll point out that multiple UNIX systems supported >> rootless, all capabilities based systems back in the day. >> >>> This goes against the least privilege principle. >> If you're going to run userspace that *requires* privilege, you have >> to have a way to *allow* privilege. If the userspace insists on a root >> based privilege model, you're stuck supporting it. Regardless of your >> principles. > Casey, > > I might be wrong, but I think you're misreading this patchset. It is not > about limiting capabilities in the init user ns at all. It's about limiting > the capabilities which a process in a child userns can get. I do understand that. My objection is not to the intent, but to the approach. Adding a capability set to the general mechanism in support of a limited, specific use case seems wrong to me. I would rather see a mechanism in userns to limit the capabilities in a user namespace than a mechanism in capabilities that is specific to user namespaces. > Any unprivileged task can create a new userns, and get a process with > all capabilities in that namespace. Always. User namespaces were a > great success in that we can do this without any resulting privilege > against host owned resources. The unaddressed issue is the expanded > kernel code surface area. An option to clone() then, to limit the capabilities available? I honestly can't recall if that has been suggested elsewhere, and apologize if it's already been dismissed as a stoopid idea. > > You say, above, (quoting out of place here) > >> Updating userspace components to behave properly in a capabilities >> environment has never been a popular activity, but is the right way >> to address this issue. And before you start on the "no one can do that, >> it's too hard", I'll point out that multiple UNIX systems supported > He's not saying no one can do that. He's saying, correctly, that the > kernel currently offers no way for userspace to do this limiting. His > patchset offers two ways: one system wide capability mask (which applies > only to non-initial user namespaces) and on per-process inherited one > which - yay - userspace can use to limit what its children will be > able to get if they unshare a user namespace. > >>> - It adds a new capability set. >> Which is a really, really bad idea. The equation for calculating effective >> privilege is already more complicated than userspace developers are generally >> willing to put up with. > This is somewhat true, but I think the semantics of what is proposed here are > about as straightforward as you could hope for, and you can basically reason > about them completely independently of the other sets. Only when reasoning > about the correctness of this code do you need to consider the other sets. Not > when administering a system. > > If you want root in a child user namespace to not have CAP_MAC_ADMIN, you drop > it from your pU. Simple as that. > >>> This set dictates what capabilities are granted in namespaces (instead >>> of always getting full caps). >> I would not expect container developers to be eager to learn how to use >> this facility. > I'm a container developer, and I'm excited about it :) OK, well, I'm wrong. It's happened before and will happen again. > >>> This brings namespaces in line with the rest of the system, user >>> namespaces are no more "special". >> I'm sorry, but this makes no sense to me whatsoever. You want to introduce >> a capability set explicitly for namespaces in order to make them less >> special? > Yes, exactly. Hmm. I can't say I buy that. It makes a whole lot more sense to me to change userns than to change capabilities. > >> Maybe I'm just old and cranky. > That's fine. > >>> They now work the same way as say a transition to root does with >>> inheritable caps. >> That needs some explanation. >> >>> - This isn't intended to be used by end users per se (although they could). >>> This would be used at the same places where existing capabalities are >>> used today (e.g. init system, pam, container runtime, browser >>> sandbox), or by system administrators. >> I understand that. It is for containers. Containers are not kernel entities. > User namespaces are. > > This patch set provides userspace a way of limiting the kernel code exposed > to untrusted children, which currently does not exist. Yes, I understand. I would rather see a change to userns in support of a userns specific need than a change to capabilities for a userns specific need. >>> To give you some ideas of things you could do: >>> >>> # E.g. prevent alice from getting CAP_NET_ADMIN in user namespaces under SSH >>> echo "auth optional pam_cap.so" >> /etc/pam.d/sshd >>> echo "!cap_net_admin alice" >> /etc/security/capability.conf. >>> >>> # E.g. prevent any Docker container from ever getting CAP_DAC_OVERRIDE >>> systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \ >>> -p SecureBits=userns-strict-caps \ >>> /usr/bin/dockerd >>> >>> # E.g. kernel could be vulnerable to CAP_SYS_RAWIO exploits >>> # Prevent users from ever gaining it >>> sysctl -w cap_bound_userns_mask=0x1fffffdffff