Re: [PATCH RESEND v3 bpf-next 00/14] BPF token

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Wed, 05 Jul 2023 01:20:22 +0200

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:

> On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>>
>> Andrii Nakryiko <andrii@xxxxxxxxxx> writes:
>>
>> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> > systemd or any other container manager) to a *trusted* unprivileged
>> > application. Trust is the key here. This functionality is not about allowing
>> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> > completely up to the discretion of respective privileged application that
>> > would create a BPF token, as different production setups can and do achieve it
>> > through a combination of different means (signing, LSM, code reviews, etc),
>> > and it's undesirable and infeasible for kernel to enforce any particular way
>> > of validating trustworthiness of particular process.
>> >
>> > The main motivation for BPF token is a desire to enable containerized
>> > BPF applications to be used together with user namespaces. This is currently
>> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
>> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
>> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
>> > arbitrary memory, and it's impossible to ensure that they only read memory of
>> > processes belonging to any given namespace. This means that it's impossible to
>> > have namespace-aware CAP_BPF capability, and as such another mechanism to
>> > allow safe usage of BPF functionality is necessary. BPF token and delegation
>> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
>> > no assumption about what "trusted" constitutes in any particular case, and
>> > it's up to specific privileged applications and their surrounding
>> > infrastructure to decide that. What kernel provides is a set of APIs to create
>> > and tune BPF token, and pass it around to privileged BPF commands that are
>> > creating new BPF objects like BPF programs, BPF maps, etc.
>>
>> So a colleague pointed out today that the Seccomp Notify functionality
>> would be a way to achieve your stated goal of allowing unprivileged
>> containers to (selectively) perform bpf() syscall operations. Christian
>> Brauner has a pretty nice writeup of the functionality here:
>> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>>
>> In fact he even mentions allowing unprivileged access to bpf() as a
>> possible use case (in the second-to-last paragraph).
>>
>> AFAICT this would enable your use case without adding any new kernel
>> functionality or changing the BPF-using applications, while allowing the
>> privileged userspace daemon to make case-by-case decisions on each
>> operation instead of granting blanket capabilities (which is my main
>> objection to the token proposal, as we discussed on the last iteration
>> of the series).
>
> It's not "blanket" capabilities. You control types or maps and
> programs that could be created. And again, CAP_SYS_ADMIN guarded.
> Please, don't give CAP_SYS_ADMIN/root permissions to applications you
> can't be sure won't do something stupid and blame kernel API for it.

Right, I didn't mean "blanket" in the sense of "permission to do
anything on the system"; I do get that you can restrict which subset of
functionality you grant. However, *within* that subset, it's a blanket
permission grant. I.e., you can't issue a token that grants a *specific*
application permission to load a *specific* BPF program - you can only
grant a general "load any program" permission that can be used by anyone
who possesses the token.

I guess we could in principle extend the token mechanism to allow this,
but the kernel doesn't seem like the right place to implement such a
fine-grained policy engine...

> After all, the root process can setuid() any file and make it run with
> elevated permissions, right? Doesn't get more "blanket" than that.

Which is exactly why setuid binaries are not generally how we implement
security delegation these days. So I don't think designing a new
mechanism this way is a good idea.

>> So I'm curious whether you considered this as an alternative to
>> BPF_TOKEN? And if so, what your reason was for rejecting it?
>>
>
> Yes, I'm aware, Christian has a follow up short blog post specifically
> for using this for proxying BPF from privileged process ([0]).
>
> So, in short, I think it's not a good generic solution. It's very
> fragile and high-maintenance. It's still proxying BPF UAPI (except
> application does preserve illusion of using BPF syscall, yes, that
> part is good) with all the implications: needing to replicate all of
> UAPI (fetching all those FDs from another process, following all the
> pointers from another process' memory, etc), and also writing back all
> the correct things (into another process' memory): log content,
> log_true_size (out param), any other output parameters.

Right, OK, that bit does sound pretty tedious (although I'll note that
there are people who are trying to make all this generally more
palatable[0]).

However, all that tediousness could be avoided while still retaining the
model of blocking the syscall and asking a userspace policy daemon to
supply a verdict. This could even be done using the same token
mechanism: instead of attaching a permission to the token itself, just
make it an opaque identifier. Then, when a syscall is made that contains
the token, block it and send a notification to user space and use the
verdict that comes back in place of the token "value". The notification
could go through the same file descriptor (using read/write or an ioctl,
restricted to CAP_SYS_ADMIN), or it could be a separate one that is
returned alongside it on TOKEN_CREATE. The notification could include
all of the syscall args or a subset, depending on the command, but the
kernel can ensure there are no TOCTOU races, and no need for the policy
daemon to go poking into other another process' namespace.

Actually, using this model I don't think we would even strictly speaking
need the explicit token FD to be included by the calling application
inside the container at all? I.e., if the system policy daemon could
just instruct the kernel "please delegate all permission decisions for
this user namespace to me", it could - so to speak - issue tokens on
demand as each call is made, instead of ahead of time. Which would both
enable the policy daemon to make specific usage decisions, and wouldn't
require any change needed to the applications using BPF inside the
container (not even to include the BPF token FD).

-Toke