On Sun, Apr 28, 2024 at 09:41:20AM -0700, Andy Lutomirski wrote: > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@xxxxxxxxx> wrote: > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. > > It is needed to perform an open operation with the creds that were in > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW > > flag. This allows the process to pre-open some dirs and switch eUID > > (and other UIDs/GIDs) to the less-privileged user, while still retaining > > the possibility to open/create files within the pre-opened directory set. > > > > I’ve been contemplating this, and I want to propose a different solution. > > First, the problem Stas is solving is quite narrow and doesn’t > actually need kernel support: if I want to write a user program that > sandboxes itself, I have at least three solutions already. I can make > a userns and a mountns; I can use landlock; and I can have a separate > process that brokers filesystem access using SCM_RIGHTS. > > But what if I want to run a container, where the container can access > a specific host directory, and the contained application is not aware > of the exact technology being used? I recently started using > containers in anger in a production setting, and “anger” was > definitely the right word: binding part of a filesystem in is > *miserable*. Getting the DAC rules right is nasty. LSMs are worse. Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and move_mount(). I rewrote the bind-mount logic in systemd based on that and util-linux uses that as well now. https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html > Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I > actually gave up on making one of my use cases work on a Fedora > system. > > Here’s what I wanted to do, logically, in production: pick a host > directory, pick a host *principal* (UID, GID, label, etc), and have > the *entire container* access the directory as that principal. This is > what happens automatically if I run the whole container as a userns > with only a single UID mapped, but I don’t really want to do that for > a whole variety and of reasons. You're describing idmapped mounts for the most part which are upstream and are used in exactly that way by a lot of userspace. > > So maybe reimagining Stas’ feature a bit can actually solve this > problem. Instead of a special dirfd, what if there was a special > subtree (in the sense of open_tree) that captures a set of creds and > does all opens inside the subtree using those creds? That would mean override creds in the VFS layer when accessing a specific subtree which is a terrible idea imho. Not just because it will quickly become a potential dos when you do that with a lot of subtrees it will also have complex interactions with overlayfs. > > This isn’t a fully formed proposal, but I *think* it should be > generally fairly safe for even an unprivileged user to clone a subtree > with a specific flag set to do this. Maybe a capability would be > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating > this to a daemon if a privilege is needed, and getting the API right > might be a bit tricky. > > Then two different things could be done: > > 1. The subtree could be used unmounted or via /proc magic links. This > would be for programs that are aware of this interface. > > 2. The subtree could be mounted, and accessed through the mount would > use the captured creds. > > (Hmm. What would a new open_tree() pointing at this special subtree do?) > > > With all this done, if userspace wired it up, a container user could > do something like: > > —bind-capture-creds source=dest > > And the contained program would access source *as the user who started > the container*, and this would just work without relabeling or > fiddling with owner uids or gids or ACLs, and it would continue to > work even if the container has multiple dynamically allocated subuids > mapped (e.g. one for “root” and one for the actual application). > > Bonus points for the ability to revoke the creds in an already opened > subtree. Or even for the creds to automatically revoke themselves when > the opener exits (or maybe when a specific cred-pinning fd goes away). > > (This should work for single files as well as for directories.) > > New LSM hooks or extensions of existing hooks might be needed to make > LSMs comfortable with this. > > What do you all think? I think the problem you're describing is already mostly solved.