Re: [PATCH v2 bpf-next 00/18] BPF token

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 26 Jun 2023 15:08:49 -0700

On Fri, Jun 23, 2023 at 3:18 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>
> On 6/16/23 12:48 AM, Andrii Nakryiko wrote:
> > On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
> >>> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> >>> <andrii.nakryiko@xxxxxxxxx> wrote:
> >>>> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@xxxxxxxxx> wrote:
> >>>>> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> >>>>> <andrii.nakryiko@xxxxxxxxx> wrote:
> >>>>>> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@xxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>> Hi Andrii,
> >>>>>>>
> >>>>>>> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>> creating new BPF objects like BPF programs, BPF maps, etc.
> >>>>>>>
> >>>>>>> Is there a reason for coupling this only with the userns?
> >>>>>>
> >>>>>> There is no coupling. Without userns it is at least possible to grant
> >>>>>> CAP_BPF and other capabilities from init ns. With user namespace that
> >>>>>> becomes impossible.
> >>>>>
> >>>>> But these are not the same: delegate full cap vs delegate an fd mask?
> >>>>
> >>>> What FD mask are we talking about here? I don't recall us talking
> >>>> about any FD masks, so this one is a bit confusing without more
> >>>> context.
> >>>
> >>> Ah err, sorry yes referring to fd token (which I assumed is a mask of
> >>> allowed operations or something like that).
> >>>
> >>> So I want the possibility to delegate the fd token in the init userns.
> >>>
> >>>>>
> >>>>> One can argue unprivileged in init userns is the same privileged in
> >>>>> nested userns
> >>>>> Getting to delegate fd in init userns, then in nested ones seems logical...
> >>>>
> >>>> Again, sorry, I'm not following. Can you please elaborate what you mean?
> >>>
> >>> I mean can we use the fd token in the init user namespace too? not
> >>> only in the nested user namespaces but in the first one? Sorry I
> >>> didn't check the code.
> >>>
> >
> > [...]
> >
> >>>
> >>>>> Having the fd or "token" that gives access rights pinned in two
> >>>>> separate bpffs mounts seems too much, it crosses namespaces (mount,
> >>>>> userns etc), environments setup by privileged...
> >>>>
> >>>> See above, there is nothing namespaceable about BPF itself, and BPF
> >>>> token as well. If some production setup benefits from pinning one BPF
> >>>> token in multiple places, I don't see the problem with that.
> >>>>
> >>>>>
> >>>>> I would just make it per bpffs mount and that's it, nothing more. If a
> >>>>> program wants to bind mount it somewhere else then it's not a bpf
> >>>>> problem.
> >>>>
> >>>> And if some application wants to pin BPF token, why would that be BPF
> >>>> subsystem's problem as well?
> >>>
> >>> The credentials, capabilities, keyring, different namespaces, etc are
> >>> all attached to the owning user namespace, if the BPF subsystem goes
> >>> its own way and creates a token to split up CAP_BPF without following
> >>> that model, then it's definitely a BPF subsystem problem...  I don't
> >>> recommend that.
> >>>
> >>> Feels it's going more of a system-wide approach opening BPF
> >>> functionality where ultimately it clashes with the argument: delegate
> >>> a subset of BPF functionality to a *trusted* unprivileged application.
> >>> My reading of delegation is within a container/service hierarchy
> >>> nothing more.
> >>
> >> You're making the exact arguments that Lennart, Aleksa, and I have been
> >> making in the LSFMM presentation about this topic. It's even recorded:
> >
> > Alright, so (I think) I get a pretty good feel now for what the main
> > concerns are, and why people are trying to push this to be an FS. And
> > it's not so much that BPF token grants bpf() syscall usage to unpriv
> > (but trusted) workloads or that BPF itself is not namespaceable. The
> > main worry is that BPF token, once issues, could be
> > illegally/uncontrollably passed outside of container, intentionally or
> > not. And by having this association with mount namespace (through BPF
> > FS) we automatically limit the sharing to only contain that has access
> > to that BPF FS.
>
> +1
>
> > So I agree that it makes sense to have this mount namespace
> > association, but I also would like to keep BPF token to be a separate
> > entity from BPF FS itself, and have the ability to have multiple
> > different BPF tokens exposed in a single BPF FS instance. I think the
> > latter is important.
> >
> > So how about this slight modification: when a BPF token is created
> > using BPF_TOKEN_CREATE command, the user has to provide an FD for
> > "associated" BPF FS instance (superblock). What that does is allows
> > BPF token to be created with BPF FS and/or mount namespace association
> > set in stone. After that BPF token can only be pinned in that BPF FS
> > instance and cannot leave the boundaries of that mount namespace
> > (specific details to be worked out, this is new area for me, so I'm
> > sorry if I'm missing nuances).
>
> Given bpffs is not a singleton and there can be multiple bpffs instances
> in a container, couldn't we make the token a special bpffs mount/mode?
> Something like single .token file in that mount (for example) which can
> be opened and the fd then passed along for prog/map creation? And given
> the multiple mounts, this also allows potentially for multiple tokens?
> In other words, this is already set up by the container manager when it
> sets up mounts rather than later, and the regular bpffs instance is sth
> separate from all that. Meaning, in your container you get the usual
> bpffs instance and then one or more special bpffs instances as tokens
> at different paths (and in future they could unlock different subset of
> bpf functionality for example).

Just from a technical point of view we could do that. But I see a lot
of value in keeping BPF token creation as part of BPF syscall and its
API. And the main issue, I believe, was not allowing BPF token to
escape the intended container, which should be more than covered by
BPF_TOKEN_CREATE pinning a token into provided BPF FS instance and not
allowing it to be repinned after that.

>
> Thanks,
> Daniel