Re: [PATCH v2 bpf-next 00/18] BPF token

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 12 Jun 2023 16:04:09 -0700

On Mon, Jun 12, 2023 at 5:45 AM Dave Tucker <datucker@xxxxxxxxxx> wrote:
>
>
>
> > On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
>
>
> Hello! Author of a bpfd[1] here.
>
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> You could do that… but the problem is created due to the pattern of having a
> single binary that is responsible for:
>
> - Loading and attaching the BPF program in question
> - Interacting with maps

It is a very desirable property to couple and deploy user process and
its BPF programs/maps together and manage their lifecycle directly.
All of Meta's production applications are using this model. This
allows for a simple and reliable versioning story. This allows using
BPF skeleton and BPF global variables naturally. It makes it simple
and easy to develop, debug, version, deploy, monitor BPF applications.

It also couples BPF program attachment (link) with lifetime of the
user space process. So if it crashes or restarts without clean
detachment, we don't end up with orphaned BPF programs and maps. We've
had pretty bad issues due to such orphaned programs, and that's why
the whole BPF link concept was formalized.

So it's actually a desirable approach in a real-world production setup.

>
> Let’s set aside some of the other fun concerns of eBPF in containers:
>  - Requiring mounting of vmlinux, bpffs, traces etc…
>  - How fs permissions on host translate into permissions in containers
>
> While your proposal lets you grant a subset of CAP_BPF to some other process,
> which I imagine could also be done with SELinux, it doesn’t stop you from needing
> other required permissions for attaching tracing programs in such an
> environment.

In some cases yes, there are other parts of the kernel that would
require some more work to be able to be used. But a lot of things are
possible within bpf() syscall already, including tracing stuff.

>
> For example, say container A wants to attach a uprobe to a process in container B.
> Container A needs to be able to nsenter into container B’s pidns in order for attachment
> to succeed… but then what I can do with CAP_BPF is the least of my concerns since
> I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges
> much scarier than CAP_BPF in the first place.

You'd wager, or you know for sure? I haven't tried, so I won't make any claims.

I do know, though, that our systemd-wide profiling agent (not running
under user namespace), can attach to and profile namespaced
applications running inside containers without any nsenter.

But again, uprobe'ing some other container is just one of possible use
cases. Even if some scenarios would require more stuff beyond the BPF
token, it doesn't invalidate the need and usefulness of the BPF token.

>
> If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd)
> then with recent kernels your container workload should be fine to run entirely unprivileged,
> or worst case with only CAP_BPF since all you need to do is read/write maps.

Except we explicitly want to avoid the need for some external entity
loading BPF programs on my behalf, like I explained in replies to
Toke.

>
> Policy control - which process can request to load programs that monitor which other
> processes - would happen within this system daemon and you wouldn’t need tokens.

And we can do the same through controlling which containers/services
are issued BPF tokens. And in addition to that could employ LSM for
more dynamic and fine-granular control.

Doing this through a centralized daemon is one way of doing this. But
it's not the universally better way to do this.

>
> Since it’s easy enough to do this in userspace, I’d be strongly against adding more
> complexity into BPF to support this usecase.

I appreciate you trying to get more customers for bpfd, there is
nothing wrong with that. But this approach has major (good and bad)
implications and is not the most appropriate solution in a lot of
cases and setups.

As for complexity. If you looked at the code, you saw that it's a
completely optional feature as far as BPF UAPI goes, so your customers
won't need to care about BPF token existence, if they are happy using
bpfd solution.

>
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > This addresses main concerns brought up during the /dev/bpf discussion, and
> > fits better with overall BPF subsystem design.
> >
> > This patch set adds a basic minimum of functionality to make BPF token useful
> > and to discuss API and functionality. Currently only low-level libbpf APIs
> > support passing BPF token around, allowing to test kernel functionality, but
> > for the most part is not sufficient for real-world applications, which
> > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > was done with the intent to limit the size of patch set and concentrate on
> > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > as a separate follow up patch set kernel support makes it upstream.
> >
> > Another part that should happen once kernel-side BPF token is established, is
> > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > at well-defined locations to allow applications take advantage of this in
> > automatic fashion without explicit code changes on BPF application's side.
> > But I'd like to postpone this discussion to after BPF token concept lands.
> >
> >  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@xxxxxxxxxx/
> >  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> >  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@xxxxxx/
> >
>
> - Dave
>
> [1]: https://github.com/bpfd-dev/bpfd
>