On Tue, Jul 4, 2023 at 2:52 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote: > > Andrii Nakryiko <andrii@xxxxxxxxxx> writes: > > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > > systemd or any other container manager) to a *trusted* unprivileged > > > application. Trust is the key here. This functionality is not about allowing > > > unconditional unprivileged BPF usage. Establishing trust, though, is > > > completely up to the discretion of respective privileged application that > > > would create a BPF token, as different production setups can and do achieve it > > > through a combination of different means (signing, LSM, code reviews, etc), > > > and it's undesirable and infeasible for kernel to enforce any particular way > > > of validating trustworthiness of particular process. > > > > > > The main motivation for BPF token is a desire to enable containerized > > > BPF applications to be used together with user namespaces. This is currently > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > > arbitrary memory, and it's impossible to ensure that they only read memory of > > > processes belonging to any given namespace. This means that it's impossible to > > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > > no assumption about what "trusted" constitutes in any particular case, and > > > it's up to specific privileged applications and their surrounding > > > infrastructure to decide that. What kernel provides is a set of APIs to create > > > and tune BPF token, and pass it around to privileged BPF commands that are > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > So a colleague pointed out today that the Seccomp Notify functionality > > would be a way to achieve your stated goal of allowing unprivileged > > containers to (selectively) perform bpf() syscall operations. Christian > > Brauner has a pretty nice writeup of the functionality here: > > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development > > I'm amazed you read this. :) > The seccomp notifier comes with a lot of caveats. I think it would be > impractical if not infeasible to handle bpf() delegation. Thanks for confirming my hunch. And yeah, I read a bunch of blog posts from your blog post. The one about new mount APIs was especially useful given how little documentation I could find on them otherwise :) > > > > > In fact he even mentions allowing unprivileged access to bpf() as a > > possible use case (in the second-to-last paragraph). > > Yeah, I tried to work around a userspace regression with the > introduction of the cgroup v2 devices controller.