On Wed, 2024-02-21 at 22:37 -0500, Kent Overstreet wrote: > On Thu, Feb 22, 2024 at 01:33:14AM +0100, James Bottomley wrote: > > On Wed, 2024-02-21 at 18:01 -0500, Kent Overstreet wrote: > > > Strings are just arrays of integers, and anyways this stuff would > > > be within helpers. > > > > Length limits and comparisons are the problem > > We'd be using qstrs for this, not c strings, so they really are > equivalent to arrays for this purpose. > > > > > > > > > But what you're not seeing is the beauty and simplicity of > > > killing > > > the mapping layer. > > > > Well, that's the problem: you don't for certain use cases. That's > > what I've been trying to explain. For the fully unprivileged use > > case, sure, it all works (as does the upper 32 bits proposal or the > > integer array ... equally well. > > > > Once you're representing to the userns contained entity they have a > > privileged admin that can write to the fsimage as an apparently > > privileged user then the problems begin. > > In what sense? > > If they're in a userns and all their mounts are username mapped, > that's completely fine from a userns POV; they can put a suid root > binary into the fs image but when they mount that suid root will be > suid to the root user of their userns. if userns root can alter a suid root binary that's bind mounted from the root namespace then that's a security violation because a user in the root ns could use the altered binary to do a privilege escalation attack. > > > When usernames are strings all the way into the kernel, creating > > > and switching to a new user is a single syscall. You can't do > > > that if users are small integer identifiers to the kernel; you > > > have to create a new entry in /etc/passwd or some equivalent, and > > > that is strictly required in order to avoid collisions. Users > > > also can't be ephemeral. > > > > > > To sketch out an example of how this would work, say we've got a > > > new set_subuser() syscall and the username equivalent of chown(). > > > > > > Now if we want to run firefox as a subuser, giving it access only > > > .local/state/firefox, we'd do the following sequence of syscalls > > > within the start of the new firefox process: > > > > > > mkdir(".local/state/firefox"); > > > chown_subuser(".local/state/firefox", "firefox"); /* now owned by > > > $USER.firefox */ set_subuser("firefox"); > > > > > > If we want to guarantee uniqueness, we'd append a UUID to the > > > subusername for the chown_subuser() call, and then for subsequent > > > invocations read it with statx() (or subuser enabled equivalent) > > > for the set_subuser() call. > > > > > > Now firefox is running in a sandbox, where it has no access to > > > the rest of your home directory - unless explicitly granted with > > > normal ACLs. And the sandbox requires no system configuration; rm > > > -rfing the .local/state/firefox directory cleans everything up. > > > > > > And these trivially nest: Firefox itself wants to sandbox > > > individual tabs from each other, so firefox could run each sub- > > > process as a different subuser. > > > > > > This is dead easy compared to what we've been doing. > > > > The above is the unprivileged use case. It works, but it's not all > > we have to support. > > There is only one root user, in the sense of _actual_ root - > CAP_SYS_ADMIN and all that. No, that's not correct. CAP_SYS_ADMIN is replaced by ns_capable() for the user namespace. The creating entity of the userns becomes the ID for which ns_capable() returns true. The whole goal of deprivileging containers is to get the container root to seem like it has CAP_SYS_ADMIN but in fact it's only ns_capable(). Certain features which are allowed to the userns admin (like filesystem mappings of inner root) are policy decisions the root namespace admin needs to make. > > > > > > > > However, neither proposal would get us out of the problem > > > > > > of mount mapping because we'd have to keep the filesystem > > > > > > permission check on the owning uid unless told otherwise. > > > > > > > > > > Not sure I follow? > > > > > > > > Mounting a filesystem inside a userns can cause huge security > > > > problems if we map fs root to inner root without the admin > > > > blessing it. Think of binding /bin into the userns and then > > > > altering one of the root owned binaries as inner root: if the > > > > permission check passes, the change appears in system /bin. > > > > > > So with this proposal mount mapping becomes "map all users on > > > this filesystem to subusers of username x". That's a much simpler > > > mapping than mapping integer ranges to integer ranges, much > > > easier to verify that there aren't accidental root escpes. > > > > That doesn't work for the privileged container run in unprivileged > > userns containment use case because we need a mapping from inner to > > outer root. > > I can't parse this. "Privileged container in an unprivileged > containment"? Do you just mean a container that has root user (which > is only root over that container, not the rest of the system, of > course). A privileged container is one that has services that run as root, yes. > Any user is root over its subusers - so that works perfectly. That's only one aspect of what container root might need to be able to do. > Or do you mean something else by "privileged container"? Do you mean > a container that actually has CAP_SYS_ADMIN? That's what docker currently does when it creates a privileged container, yes. However, CAP_SYS_ADMIN is too powerful and can trivially break containment meaning this isn't a workable solution for container security. What we need is a container that can bring up privileged services without root namespace CAP_SYS_ADMIN. > > > > > > And it wouldn't have to be administrator assigned. Some > > > > > administrator assignment might be required for the username > > > > > <-> 16 bit uid mapping, but if those mappings are ephemeral > > > > > (i.e. if we get filesystems persistently storing usernames, > > > > > which is easy enough with xattrs) then that just becomes > > > > > "reserve x range of the 16 bit uid space for ephemeral > > > > > translations". > > > > > > > > *if* the user names you're dealing with are all unprivileged. > > > > When we have a mix of privileged and unprivileged users owning > > > > the files, the problems begin. > > > > > > Yes, all subusers are unprivilidged - only one username, the > > > empty username (which we'd probably map to root) maps to existing > > > uid 0. > > > > But, as I said above, that's only a subset of the use cases. The > > equally big use case is figuring out how to run privileged > > containers in a deprivileged mode and yet still allow them to > > update images (and other things). > > If you're running in a userns, all your mounts get the same user > mapping as your userns - where that usermapping is just prepending > the username of the userns. That part is easy. No, it's not. Any filesystem that's specific *only* to the container can do an inner root to real root mapping. Any bind mount visible from outside can't be allowed to do this because of the suid security issue above. Determining this "visibility" is really hard, which is why it's become a policy based mapping controlled by the root namespace admin. > The big difficulty with letting them update images is that our > current filesystems really aren't ready for the mounting of untrusted > images - they're ~100k loc codebases each and the amount of hardening > required is significant. I would hazard to guess that XFS is the > furthest along is this respect (from all the screaming I hear from > Darrick about syzkaller it sounds like they're taking this the most > seriously) - but I would hesitate to depend on any of our filesystems > to be secure in this respect, even my own - not until we get them > rewritten in Rust... This is a completely separate issue: whether we can allow an unprivileged container to mount a fs image that might have been crafted to attack the system. Most FS developers believe we'll never achieve the point where any specially crafted fs image is safe to mount by an unprivileged user so again whether the container is allowed to mount a fs from a block or network device becomes a policy issue for the root namespace admin rather than something we can globally allow. James