On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@xxxxxxxxx> wrote: > > Hi Andrii, > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token. > > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > Is there a reason for coupling this only with the userns? There is no coupling. Without userns it is at least possible to grant CAP_BPF and other capabilities from init ns. With user namespace that becomes impossible. > The "trusted unprivileged" assumed by systemd can be in init userns? It doesn't have to be systemd, but yes, BPF token can be created only when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family of commands). > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > LSM maintainers. BPF token concept is not changing anything about LSM > > approach, but can be combined with LSM hooks for very fine-grained security > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > (context), which in combination with BPF LSM would allow implementing a very > > dynamic and fine-granular custom security policies on top of BPF token. In the > > interest of minimizing API surface area discussions this is going to be > > added in follow up patches, as it's not essential to the fundamental concept > > of delegatable BPF token. > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > difference is the idea of using virtual anon_inode file to hold BPF token and > > allowing multiple independent instances of them, each with its own set of > > restrictions. BPF pinning solves the problem of exposing such BPF token > > through file system (BPF FS, in this case) for cases where transferring FDs > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > What's the use case for transfering over unix domain sockets? I'm not sure I understand the question. Unix domain socket (specifically its SCM_RIGHTS ancillary message) allows to transfer files between processes, which is one way to pass BPF object (like prog/map/link, and now token). BPF FS is the other one. In practice it's usually BPF FS, but there is no presumption about how file reference is transferred. > > Will BPF token translation happen if you cross the different namespaces? What does BPF token translation mean specifically? Currently it's a very simple kernel object with refcnt and a few flags, so there is nothing to translate? > > If the token is pinned into different bpffs, will the token share the > same context? So I was planning to allow a user process creating a BPF token to specify custom user-provided data (context). This is not in this patch set, but is it what you are asking about? Regardless, pinning BPF object in BPF FS is just basically bumping a refcnt and exposes that object in a way that can be looked up through file system path (using bpf() syscall's BPF_OBJ_GET command). Underlying object isn't cloned or copied, it's exactly the same object with the same shared internal state.