Re: [PATCH v2 bpf-next 00/18] BPF token

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 9 Jun 2023 15:57:00 -0700

On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@xxxxxxxxx> wrote:
>
> Hi Andrii,
>
> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> Is there a reason for coupling this only with the userns?

There is no coupling. Without userns it is at least possible to grant
CAP_BPF and other capabilities from init ns. With user namespace that
becomes impossible.

> The "trusted unprivileged" assumed by systemd can be in init userns?

It doesn't have to be systemd, but yes, BPF token can be created only
when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
of commands).

>
>
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
>
> What's the use case for transfering over unix domain sockets?

I'm not sure I understand the question. Unix domain socket
(specifically its SCM_RIGHTS ancillary message) allows to transfer
files between processes, which is one way to pass BPF object (like
prog/map/link, and now token). BPF FS is the other one. In practice
it's usually BPF FS, but there is no presumption about how file
reference is transferred.

>
> Will BPF token translation happen if you cross the different namespaces?

What does BPF token translation mean specifically? Currently it's a
very simple kernel object with refcnt and a few flags, so there is
nothing to translate?

>
> If the token is pinned into different bpffs, will the token share the
> same context?

So I was planning to allow a user process creating a BPF token to
specify custom user-provided data (context). This is not in this patch
set, but is it what you are asking about?

Regardless, pinning BPF object in BPF FS is just basically bumping a
refcnt and exposes that object in a way that can be looked up through
file system path (using bpf() syscall's BPF_OBJ_GET command).
Underlying object isn't cloned or copied, it's exactly the same object
with the same shared internal state.