Re: [PATCH v2 bpf-next 00/18] BPF token

Djalal Harouni <tixxdz@xxxxxxxxx> · Mon, 12 Jun 2023 17:52:46 +0200

On Mon, Jun 12, 2023 at 2:45 PM Dave Tucker <datucker@xxxxxxxxxx> wrote:
>
>
>
> > On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
>
>
> Hello! Author of a bpfd[1] here.
>
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> You could do that… but the problem is created due to the pattern of having a
> single binary that is responsible for:
>
> - Loading and attaching the BPF program in question
> - Interacting with maps
>
> Let’s set aside some of the other fun concerns of eBPF in containers:
>  - Requiring mounting of vmlinux, bpffs, traces etc…
>  - How fs permissions on host translate into permissions in containers
>
> While your proposal lets you grant a subset of CAP_BPF to some other process,
> which I imagine could also be done with SELinux, it doesn’t stop you from needing
>
> other required permissions for attaching tracing programs in such an
> environment.
>
> For example, say container A wants to attach a uprobe to a process in container B.
> Container A needs to be able to nsenter into container B’s pidns in order for attachment
> to succeed… but then what I can do with CAP_BPF is the least of my concerns since
> I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges
> much scarier than CAP_BPF in the first place.
>
> If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd)
> then with recent kernels your container workload should be fine to run entirely unprivileged,
> or worst case with only CAP_BPF since all you need to do is read/write maps.
>
> Policy control - which process can request to load programs that monitor which other
> processes - would happen within this system daemon and you wouldn’t need tokens.
>
> Since it’s easy enough to do this in userspace, I’d be strongly against adding more
> complexity into BPF to support this usecase.

For some cases complexity could be the other way, bpf by design are
small programs that can be loaded/unloaded dynamically and work on
their own... easily adaptable to dynamic workload... not all bpf are
the same...

Stuffing *everything* together and performing round trips between main
container and container transfering, loading and attaching bpf
programs would question what's the advantage?