On Fri, Jun 23, 2023 at 3:18 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 6/16/23 12:48 AM, Andrii Nakryiko wrote: > > On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > >> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote: > >>> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko > >>> <andrii.nakryiko@xxxxxxxxx> wrote: > >>>> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@xxxxxxxxx> wrote: > >>>>> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko > >>>>> <andrii.nakryiko@xxxxxxxxx> wrote: > >>>>>> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@xxxxxxxxx> wrote: > >>>>>>> > >>>>>>> Hi Andrii, > >>>>>>> > >>>>>>> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> ... > >>>>>>>> creating new BPF objects like BPF programs, BPF maps, etc. > >>>>>>> > >>>>>>> Is there a reason for coupling this only with the userns? > >>>>>> > >>>>>> There is no coupling. Without userns it is at least possible to grant > >>>>>> CAP_BPF and other capabilities from init ns. With user namespace that > >>>>>> becomes impossible. > >>>>> > >>>>> But these are not the same: delegate full cap vs delegate an fd mask? > >>>> > >>>> What FD mask are we talking about here? I don't recall us talking > >>>> about any FD masks, so this one is a bit confusing without more > >>>> context. > >>> > >>> Ah err, sorry yes referring to fd token (which I assumed is a mask of > >>> allowed operations or something like that). > >>> > >>> So I want the possibility to delegate the fd token in the init userns. > >>> > >>>>> > >>>>> One can argue unprivileged in init userns is the same privileged in > >>>>> nested userns > >>>>> Getting to delegate fd in init userns, then in nested ones seems logical... > >>>> > >>>> Again, sorry, I'm not following. Can you please elaborate what you mean? > >>> > >>> I mean can we use the fd token in the init user namespace too? not > >>> only in the nested user namespaces but in the first one? Sorry I > >>> didn't check the code. > >>> > > > > [...] > > > >>> > >>>>> Having the fd or "token" that gives access rights pinned in two > >>>>> separate bpffs mounts seems too much, it crosses namespaces (mount, > >>>>> userns etc), environments setup by privileged... > >>>> > >>>> See above, there is nothing namespaceable about BPF itself, and BPF > >>>> token as well. If some production setup benefits from pinning one BPF > >>>> token in multiple places, I don't see the problem with that. > >>>> > >>>>> > >>>>> I would just make it per bpffs mount and that's it, nothing more. If a > >>>>> program wants to bind mount it somewhere else then it's not a bpf > >>>>> problem. > >>>> > >>>> And if some application wants to pin BPF token, why would that be BPF > >>>> subsystem's problem as well? > >>> > >>> The credentials, capabilities, keyring, different namespaces, etc are > >>> all attached to the owning user namespace, if the BPF subsystem goes > >>> its own way and creates a token to split up CAP_BPF without following > >>> that model, then it's definitely a BPF subsystem problem... I don't > >>> recommend that. > >>> > >>> Feels it's going more of a system-wide approach opening BPF > >>> functionality where ultimately it clashes with the argument: delegate > >>> a subset of BPF functionality to a *trusted* unprivileged application. > >>> My reading of delegation is within a container/service hierarchy > >>> nothing more. > >> > >> You're making the exact arguments that Lennart, Aleksa, and I have been > >> making in the LSFMM presentation about this topic. It's even recorded: > > > > Alright, so (I think) I get a pretty good feel now for what the main > > concerns are, and why people are trying to push this to be an FS. And > > it's not so much that BPF token grants bpf() syscall usage to unpriv > > (but trusted) workloads or that BPF itself is not namespaceable. The > > main worry is that BPF token, once issues, could be > > illegally/uncontrollably passed outside of container, intentionally or > > not. And by having this association with mount namespace (through BPF > > FS) we automatically limit the sharing to only contain that has access > > to that BPF FS. > > +1 > > > So I agree that it makes sense to have this mount namespace > > association, but I also would like to keep BPF token to be a separate > > entity from BPF FS itself, and have the ability to have multiple > > different BPF tokens exposed in a single BPF FS instance. I think the > > latter is important. > > > > So how about this slight modification: when a BPF token is created > > using BPF_TOKEN_CREATE command, the user has to provide an FD for > > "associated" BPF FS instance (superblock). What that does is allows > > BPF token to be created with BPF FS and/or mount namespace association > > set in stone. After that BPF token can only be pinned in that BPF FS > > instance and cannot leave the boundaries of that mount namespace > > (specific details to be worked out, this is new area for me, so I'm > > sorry if I'm missing nuances). > > Given bpffs is not a singleton and there can be multiple bpffs instances > in a container, couldn't we make the token a special bpffs mount/mode? > Something like single .token file in that mount (for example) which can > be opened and the fd then passed along for prog/map creation? And given > the multiple mounts, this also allows potentially for multiple tokens? > In other words, this is already set up by the container manager when it > sets up mounts rather than later, and the regular bpffs instance is sth > separate from all that. Meaning, in your container you get the usual > bpffs instance and then one or more special bpffs instances as tokens > at different paths (and in future they could unlock different subset of > bpf functionality for example). Just from a technical point of view we could do that. But I see a lot of value in keeping BPF token creation as part of BPF syscall and its API. And the main issue, I believe, was not allowing BPF token to escape the intended container, which should be more than covered by BPF_TOKEN_CREATE pinning a token into provided BPF FS instance and not allowing it to be repinned after that. > > Thanks, > Daniel