Re: [PATCH v2 bpf-next 00/18] BPF token

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 14 Jun 2023 11:39:28 +0200

On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@xxxxxxxxx> wrote:
> > >
> > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > >
> > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@xxxxxxxxx> wrote:
> > > > >
> > > > > Hi Andrii,
> > > > >
> > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > ...
> > > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > > >
> > > > > Is there a reason for coupling this only with the userns?
> > > >
> > > > There is no coupling. Without userns it is at least possible to grant
> > > > CAP_BPF and other capabilities from init ns. With user namespace that
> > > > becomes impossible.
> > >
> > > But these are not the same: delegate full cap vs delegate an fd mask?
> >
> > What FD mask are we talking about here? I don't recall us talking
> > about any FD masks, so this one is a bit confusing without more
> > context.
> 
> Ah err, sorry yes referring to fd token (which I assumed is a mask of
> allowed operations or something like that).
> 
> So I want the possibility to delegate the fd token in the init userns.
> 
> > >
> > > One can argue unprivileged in init userns is the same privileged in
> > > nested userns
> > > Getting to delegate fd in init userns, then in nested ones seems logical...
> >
> > Again, sorry, I'm not following. Can you please elaborate what you mean?
> 
> I mean can we use the fd token in the init user namespace too? not
> only in the nested user namespaces but in the first one? Sorry I
> didn't check the code.
> 
> 
> > >
> > > > > The "trusted unprivileged" assumed by systemd can be in init userns?
> > > >
> > > > It doesn't have to be systemd, but yes, BPF token can be created only
> > > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> > > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> > > > of commands).
> > >
> > > I'm more into getting fd delegation work also in the first init userns...
> > >
> > > I can't understand why it's not possible or doable?
> > >
> >
> > I don't know what you are proposing, as I mentioned above, so it's
> > hard to answer this question.
> >
> 
> 
> > > > >
> > > > >
> > > > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > > > interest of minimizing API surface area discussions this is going to be
> > > > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > > > of delegatable BPF token.
> > > > > >
> > > > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > > > allowing multiple independent instances of them, each with its own set of
> > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > > >
> > > > > What's the use case for transfering over unix domain sockets?
> > > >
> > > > I'm not sure I understand the question. Unix domain socket
> > > > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > > > files between processes, which is one way to pass BPF object (like
> > > > prog/map/link, and now token). BPF FS is the other one. In practice
> > > > it's usually BPF FS, but there is no presumption about how file
> > > > reference is transferred.
> > >
> > > Got it.
> > >
> > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> > > userns, no ?
> > >
> > > I assume such which allows to set up things in a hierarchical way...
> > >
> > > If I set up the environment to lock things down the line, I find it
> > > strange if a received fd would allow me to do more things than what
> > > was planned when I created the environment: namespaces, mounts, etc
> > >
> > > I think you have to add the owning userns context to the fd or
> > > "token", and on the receiving part if the current userns is the same
> > > or a nested one of the current userns hierarchy then allow bpf
> > > operation, otherwise fail with -EACCESS or something similar...
> > >
> >
> > I think I mentioned problems with namespacing BPF itself. It's just
> > fundamentally impossible due to a system-wide nature of BPF. So we can
> > pretend to somehow attach/restrict BPF token to some namespace, but it
> > still allows BPF programs to peek at any kernel state or user-space
> > process.
> 
> I'm not referring to namespacing BPF, but about the same token that
> can fly between containers...
> More or less problems mentioned by Casey
> https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@xxxxxxxxxx/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823
> 
> I think that a token or the fd should be part of the bpffs and should
> not be shared between containers or crosse namespaces by default
> without control... hence the suggested protection:
> https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@xxxxxxxxxxxxxx/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd
> 
> 
> > So I'd rather us not pretend we can do something that we actually
> > cannot enforce.
> 
> Actually it is to protect against accidental token sharing or abuse...
> so completely different things.
> 
> 
> > >
> > > > >
> > > > > Will BPF token translation happen if you cross the different namespaces?
> > > >
> > > > What does BPF token translation mean specifically? Currently it's a
> > > > very simple kernel object with refcnt and a few flags, so there is
> > > > nothing to translate?
> > >
> > > Please see above comment about the owning userns context
> > >
> > > > >
> > > > > If the token is pinned into different bpffs, will the token share the
> > > > > same context?
> > > >
> > > > So I was planning to allow a user process creating a BPF token to
> > > > specify custom user-provided data (context). This is not in this patch
> > > > set, but is it what you are asking about?
> > >
> > > Exactly, define what you can access inside the container... this would
> > > align with Andy's suggestion "making BPF behave sensibly in that
> > > container seems like it should also be necessary." I do agree on this.
> > >
> >
> > I don't know what Andy's suggestion actually is (as I honestly can't
> > make out what your proposal is, sorry; you guys are not making it easy
> > on me by being pretty vague and nonspecific). But see above about
> > pretending to contain BPF within a container. There is no such thing.
> > BPF is system-wide.
> 
> Sorry about that, I can quickly put: you may restrict types of bpf
> programs, you may disable or nop probes if they are running without a
> process context, if the triggered probe is owned by root by specific
> uid? if the process is under a specific cgroup hierarchy etc... Are
> the above possible?
> 
> 
> > > Again I think LSM and bpf+lsm should have the final word on this too...
> > >
> >
> > Yes, I also think that having LSM on top is beneficial. But not a
> > strict requirement and more or less orthogonal.
> 
> I do think there should be LSM hooks to tighten this, as LSMs have
> more context outside of BPF...
> 
> 
> > >
> > > > Regardless, pinning BPF object in BPF FS is just basically bumping a
> > > > refcnt and exposes that object in a way that can be looked up through
> > > > file system path (using bpf() syscall's BPF_OBJ_GET command).
> > > > Underlying object isn't cloned or copied, it's exactly the same object
> > > > with the same shared internal state.
> > >
> > > This is the part I also find strange, I can understand pinning a bpf
> > > program, map, etc, but an fd that gives some access rights should be
> > > part of the filesystem from the start, I don't get the extra pinning.
> >
> > BPF pinning of BPF token is optional. Everything still works without
> > any BPF FS mount at all. It's an FD, BPF FS is just one of the means
> > to pass FD to another process. I actually don't see why coupling BPF
> > FS and BPF token is simpler.
> 
> I think it's better the other way around since bpffs is per super
> block and separate mount then it is already solved, you just get that
> special fd from the fs and pass it...
> 
> 
> > Now, BPF token is a kernel object, with its own state. It has an FD
> > associated with it. It can be passed around and provided as an
> > argument to bpf() syscall. In that sense it's just like BPF
> > prog/map/link, just another BPF object.
> >
> > > Also it seems bpffs is per superblock mount so why not allow
> > > privileged to mount bpffs with the corresponding information, then
> > > privileged can open the fd, set it up and pass it down the line when
> > > executing the main program?  or even allow unprivileged to open it on
> > > bpffs with some restrictive conditions?
> > >
> > > Then it would be the business of the privileged to bind mount bpffs in
> > > some other places, share it, etc
> >
> > How is this fundamentally different from BPF token pinning by
> > *privileged* process? Except we are not conflating BPF FS as a way to
> > pin/get many different BPF objects with BPF token itself. In both
> > cases it's up to privileged process to set up sharing of BPF token
> > appropriately.
> 
> I'm not convinced about the use case of sharing BPF tokens between
> containers or services...
> 
> Every container or service has its own separate bpffs, what's the
> point of pinning a shared token created by a different container
> compared to mounting separate bpffs with an fd token prepared to be
> used for that specific container?
> 
> Then the container/service can delegate it to child processes, etc...
> but sharing between containers and crossing user namespaces, mount
> namespaces of such containers where bpffs is already separate in that
> context? I don't see the point, and it just opens the room to token
> misuse...
> 
> 
> > >
> > > Having the fd or "token" that gives access rights pinned in two
> > > separate bpffs mounts seems too much, it crosses namespaces (mount,
> > > userns etc), environments setup by privileged...
> >
> > See above, there is nothing namespaceable about BPF itself, and BPF
> > token as well. If some production setup benefits from pinning one BPF
> > token in multiple places, I don't see the problem with that.
> >
> > >
> > > I would just make it per bpffs mount and that's it, nothing more. If a
> > > program wants to bind mount it somewhere else then it's not a bpf
> > > problem.
> >
> > And if some application wants to pin BPF token, why would that be BPF
> > subsystem's problem as well?
> 
> The credentials, capabilities, keyring, different namespaces, etc are
> all attached to the owning user namespace, if the BPF subsystem goes
> its own way and creates a token to split up CAP_BPF without following
> that model, then it's definitely a BPF subsystem problem...  I don't
> recommend that.
> 
> Feels it's going more of a system-wide approach opening BPF
> functionality where ultimately it clashes with the argument: delegate
> a subset of BPF functionality to a *trusted* unprivileged application.
> My reading of delegation is within a container/service hierarchy
> nothing more.

You're making the exact arguments that Lennart, Aleksa, and I have been
making in the LSFMM presentation about this topic. It's even recorded:

https://youtu.be/4CCRTWEZLpw?t=1546

So we fully agree with you here.