Replying to a couple emails at once... On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai <cyphar@xxxxxxxxxx> wrote: > > On 2024-04-28, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@xxxxxxxxx> wrote: > > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. > > > It is needed to perform an open operation with the creds that were in > > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW > > > flag. This allows the process to pre-open some dirs and switch eUID > > > (and other UIDs/GIDs) to the less-privileged user, while still retaining > > > the possibility to open/create files within the pre-opened directory set. > > > > > > > I’ve been contemplating this, and I want to propose a different solution. > > > > First, the problem Stas is solving is quite narrow and doesn’t > > actually need kernel support: if I want to write a user program that > > sandboxes itself, I have at least three solutions already. I can make > > a userns and a mountns; I can use landlock; and I can have a separate > > process that brokers filesystem access using SCM_RIGHTS. > > > > But what if I want to run a container, where the container can access > > a specific host directory, and the contained application is not aware > > of the exact technology being used? I recently started using > > containers in anger in a production setting, and “anger” was > > definitely the right word: binding part of a filesystem in is > > *miserable*. Getting the DAC rules right is nasty. LSMs are worse. > > Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I > > actually gave up on making one of my use cases work on a Fedora > > system. > > > > Here’s what I wanted to do, logically, in production: pick a host > > directory, pick a host *principal* (UID, GID, label, etc), and have > > the *entire container* access the directory as that principal. This is > > what happens automatically if I run the whole container as a userns > > with only a single UID mapped, but I don’t really want to do that for > > a whole variety and of reasons. > > > > So maybe reimagining Stas’ feature a bit can actually solve this > > problem. Instead of a special dirfd, what if there was a special > > subtree (in the sense of open_tree) that captures a set of creds and > > does all opens inside the subtree using those creds? > > > > This isn’t a fully formed proposal, but I *think* it should be > > generally fairly safe for even an unprivileged user to clone a subtree > > with a specific flag set to do this. Maybe a capability would be > > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating > > this to a daemon if a privilege is needed, and getting the API right > > might be a bit tricky. > > Tying this to an actual mount rather than a file handle sounds like a > more plausible proposal than OA2_CRED_INHERIT, but it just seems that > this is going to re-create all of the work that went into id-mapped > mounts but with the extra-special step of making the generic VFS > permissions no longer work normally (unless the idea is that everything > would pretend to be owned by current_fsuid()?). I was assuming that the owner uid and gid would be show to stat, etc as usual. But the permission checks would be done against the captured creds. > > IMHO it also isn't enough to just make open work, you need to make all > operations work (which leads to a non-trivial amount of > filesystem-specific handling), which is just idmapped mounts. A lot of > work was put into making sure that is safe, and collapsing owners seems > like it will cause a lot of headaches. > > I also find it somewhat amusing that this proposal is to basically give > up on multi-user permissions for this one directory tree because it's > too annoying to deal with. In that case, isn't chmod 777 a simpler > solution? (I'm being a bit flippant, of course there is a difference, > but the net result is that all users in the container would have the > same permissions with all of the fun issues that implies.) > > In short, AFAICS idmapped mounts pretty much solve this problem (minus > the ability to collapse users, which I suspect is not a good idea in > general)? > With my kernel hat on, maybe I agree. But with my *user* hat on, I think I pretty strongly disagree. Look, idmapis lousy for unprivileged use: $ install -m 0700 -d test_directory $ echo 'hi there' >test_directory/file $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] # cat /tmp/file hi there <-- Hey, look, this kind of works! # setpriv --reuid=1 ls /tmp ls: cannot open directory '/tmp': Permission denied <-- Gee, thanks, Linux! Obviously this is a made up example. But it's quite analogous to a real example. Suppose I want to make a directory that will contain some MySQL data. I don't want to share this directory with anyone else, so I set its mode to 0700. Then I want to fire up an unprivileged MySQL container, so I build or download it, and then I run it and bind my directory to /var/lib/mysql and I run it. I don't need to think about UIDs or anything because it's 2024 and containers just work. Okay, I need to setenforce 0 because I'm on Fedora and SELinux makes absolutely no sense in a container world, but I can live with that. Except that it doesn't work! Because unless I want to manually futz with the idmaps to get mysql to have access to the directory inside the container, only *root* gets to get in. But I bet that even futzing with the idmap doesn't work, because software like mysql often expects that root *and* a user can access data. And some software even does privilege separation and uses more than one UID. So I want a way to give *an entire container* access to a directory. Classic UNIX DAC is just *wrong* for this use case. Maybe idmaps could learn a way to squash multiple ids down to one. Or maybe something like my silly credential-capturing mount proposal could work. But the status quo is not actually amazing IMO. I haven't looked at the idmap implementation nearly enough to have any opinion as to whether squashing UID is practical or whether there's any sensible way to specify it in the configuration. > On Apr 29, 2024, at 2:12 AM, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and > move_mount(). I rewrote the bind-mount logic in systemd based on that > and util-linux uses that as well now. > https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html > Yep, I remember that. >> Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I >> actually gave up on making one of my use cases work on a Fedora >> system. >> >> Here’s what I wanted to do, logically, in production: pick a host >> directory, pick a host *principal* (UID, GID, label, etc), and have >> the *entire container* access the directory as that principal. This is >> what happens automatically if I run the whole container as a userns >> with only a single UID mapped, but I don’t really want to do that for >> a whole variety and of reasons. > > You're describing idmapped mounts for the most part which are upstream > and are used in exactly that way by a lot of userspace. > See above... >> >> So maybe reimagining Stas’ feature a bit can actually solve this >> problem. Instead of a special dirfd, what if there was a special >> subtree (in the sense of open_tree) that captures a set of creds and >> does all opens inside the subtree using those creds? > > That would mean override creds in the VFS layer when accessing a > specific subtree which is a terrible idea imho. Not just because it will > quickly become a potential dos when you do that with a lot of subtrees > it will also have complex interactions with overlayfs. I was deliberately talking about semantics, not implementation. This may well be impossible to implement straightforwardly.