Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Sun, 28 Apr 2024 09:41:20 -0700

> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@xxxxxxxxx> wrote:
> This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> It is needed to perform an open operation with the creds that were in
> effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> flag. This allows the process to pre-open some dirs and switch eUID
> (and other UIDs/GIDs) to the less-privileged user, while still retaining
> the possibility to open/create files within the pre-opened directory set.
>

I’ve been contemplating this, and I want to propose a different solution.

First, the problem Stas is solving is quite narrow and doesn’t
actually need kernel support: if I want to write a user program that
sandboxes itself, I have at least three solutions already.  I can make
a userns and a mountns; I can use landlock; and I can have a separate
process that brokers filesystem access using SCM_RIGHTS.

But what if I want to run a container, where the container can access
a specific host directory, and the contained application is not aware
of the exact technology being used?  I recently started using
containers in anger in a production setting, and “anger” was
definitely the right word: binding part of a filesystem in is
*miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
actually gave up on making one of my use cases work on a Fedora
system.

Here’s what I wanted to do, logically, in production: pick a host
directory, pick a host *principal* (UID, GID, label, etc), and have
the *entire container* access the directory as that principal. This is
what happens automatically if I run the whole container as a userns
with only a single UID mapped, but I don’t really want to do that for
a whole variety and of reasons.

So maybe reimagining Stas’ feature a bit can actually solve this
problem.  Instead of a special dirfd, what if there was a special
subtree (in the sense of open_tree) that captures a set of creds and
does all opens inside the subtree using those creds?

This isn’t a fully formed proposal, but I *think* it should be
generally fairly safe for even an unprivileged user to clone a subtree
with a specific flag set to do this. Maybe a capability would be
needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
this to a daemon if a privilege is needed, and getting the API right
might be a bit tricky.

Then two different things could be done:

1. The subtree could be used unmounted or via /proc magic links. This
would be for programs that are aware of this interface.

2. The subtree could be mounted, and accessed through the mount would
use the captured creds.

(Hmm. What would a new open_tree() pointing at this special subtree do?)

With all this done, if userspace wired it up, a container user could
do something like:

—bind-capture-creds source=dest

And the contained program would access source *as the user who started
the container*, and this would just work without relabeling or
fiddling with owner uids or gids or ACLs, and it would continue to
work even if the container has multiple dynamically allocated subuids
mapped (e.g. one for “root” and one for the actual application).

Bonus points for the ability to revoke the creds in an already opened
subtree. Or even for the creds to automatically revoke themselves when
the opener exits (or maybe when a specific cred-pinning fd goes away).

(This should work for single files as well as for directories.)

New LSM hooks or extensions of existing hooks might be needed to make
LSMs comfortable with this.

What do you all think?