On Mon, Jun 12, 2023 at 2:45 PM Dave Tucker <datucker@xxxxxxxxxx> wrote: > > > > > On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token. > > > Hello! Author of a bpfd[1] here. > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > You could do that… but the problem is created due to the pattern of having a > single binary that is responsible for: > > - Loading and attaching the BPF program in question > - Interacting with maps > > Let’s set aside some of the other fun concerns of eBPF in containers: > - Requiring mounting of vmlinux, bpffs, traces etc… > - How fs permissions on host translate into permissions in containers > > While your proposal lets you grant a subset of CAP_BPF to some other process, > which I imagine could also be done with SELinux, it doesn’t stop you from needing > > other required permissions for attaching tracing programs in such an > environment. > > For example, say container A wants to attach a uprobe to a process in container B. > Container A needs to be able to nsenter into container B’s pidns in order for attachment > to succeed… but then what I can do with CAP_BPF is the least of my concerns since > I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges > much scarier than CAP_BPF in the first place. > > If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd) > then with recent kernels your container workload should be fine to run entirely unprivileged, > or worst case with only CAP_BPF since all you need to do is read/write maps. > > Policy control - which process can request to load programs that monitor which other > processes - would happen within this system daemon and you wouldn’t need tokens. > > Since it’s easy enough to do this in userspace, I’d be strongly against adding more > complexity into BPF to support this usecase. For some cases complexity could be the other way, bpf by design are small programs that can be loaded/unloaded dynamically and work on their own... easily adaptable to dynamic workload... not all bpf are the same... Stuffing *everything* together and performing round trips between main container and container transfering, loading and attaching bpf programs would question what's the advantage?