Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: > On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> Andrii Nakryiko <andrii@xxxxxxxxxx> writes: >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate >> > a subset of BPF functionality from privileged system-wide daemon (e.g., >> > systemd or any other container manager) to a *trusted* unprivileged >> > application. Trust is the key here. This functionality is not about allowing >> > unconditional unprivileged BPF usage. Establishing trust, though, is >> > completely up to the discretion of respective privileged application that >> > would create a BPF token, as different production setups can and do achieve it >> > through a combination of different means (signing, LSM, code reviews, etc), >> > and it's undesirable and infeasible for kernel to enforce any particular way >> > of validating trustworthiness of particular process. >> > >> > The main motivation for BPF token is a desire to enable containerized >> > BPF applications to be used together with user namespaces. This is currently >> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced >> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF >> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read >> > arbitrary memory, and it's impossible to ensure that they only read memory of >> > processes belonging to any given namespace. This means that it's impossible to >> > have namespace-aware CAP_BPF capability, and as such another mechanism to >> > allow safe usage of BPF functionality is necessary. BPF token and delegation >> > of it to a trusted unprivileged applications is such mechanism. Kernel makes >> > no assumption about what "trusted" constitutes in any particular case, and >> > it's up to specific privileged applications and their surrounding >> > infrastructure to decide that. What kernel provides is a set of APIs to create >> > and tune BPF token, and pass it around to privileged BPF commands that are >> > creating new BPF objects like BPF programs, BPF maps, etc. >> >> So a colleague pointed out today that the Seccomp Notify functionality >> would be a way to achieve your stated goal of allowing unprivileged >> containers to (selectively) perform bpf() syscall operations. Christian >> Brauner has a pretty nice writeup of the functionality here: >> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development >> >> In fact he even mentions allowing unprivileged access to bpf() as a >> possible use case (in the second-to-last paragraph). >> >> AFAICT this would enable your use case without adding any new kernel >> functionality or changing the BPF-using applications, while allowing the >> privileged userspace daemon to make case-by-case decisions on each >> operation instead of granting blanket capabilities (which is my main >> objection to the token proposal, as we discussed on the last iteration >> of the series). > > It's not "blanket" capabilities. You control types or maps and > programs that could be created. And again, CAP_SYS_ADMIN guarded. > Please, don't give CAP_SYS_ADMIN/root permissions to applications you > can't be sure won't do something stupid and blame kernel API for it. Right, I didn't mean "blanket" in the sense of "permission to do anything on the system"; I do get that you can restrict which subset of functionality you grant. However, *within* that subset, it's a blanket permission grant. I.e., you can't issue a token that grants a *specific* application permission to load a *specific* BPF program - you can only grant a general "load any program" permission that can be used by anyone who possesses the token. I guess we could in principle extend the token mechanism to allow this, but the kernel doesn't seem like the right place to implement such a fine-grained policy engine... > After all, the root process can setuid() any file and make it run with > elevated permissions, right? Doesn't get more "blanket" than that. Which is exactly why setuid binaries are not generally how we implement security delegation these days. So I don't think designing a new mechanism this way is a good idea. >> So I'm curious whether you considered this as an alternative to >> BPF_TOKEN? And if so, what your reason was for rejecting it? >> > > Yes, I'm aware, Christian has a follow up short blog post specifically > for using this for proxying BPF from privileged process ([0]). > > So, in short, I think it's not a good generic solution. It's very > fragile and high-maintenance. It's still proxying BPF UAPI (except > application does preserve illusion of using BPF syscall, yes, that > part is good) with all the implications: needing to replicate all of > UAPI (fetching all those FDs from another process, following all the > pointers from another process' memory, etc), and also writing back all > the correct things (into another process' memory): log content, > log_true_size (out param), any other output parameters. Right, OK, that bit does sound pretty tedious (although I'll note that there are people who are trying to make all this generally more palatable[0]). However, all that tediousness could be avoided while still retaining the model of blocking the syscall and asking a userspace policy daemon to supply a verdict. This could even be done using the same token mechanism: instead of attaching a permission to the token itself, just make it an opaque identifier. Then, when a syscall is made that contains the token, block it and send a notification to user space and use the verdict that comes back in place of the token "value". The notification could go through the same file descriptor (using read/write or an ioctl, restricted to CAP_SYS_ADMIN), or it could be a separate one that is returned alongside it on TOKEN_CREATE. The notification could include all of the syscall args or a subset, depending on the command, but the kernel can ensure there are no TOCTOU races, and no need for the policy daemon to go poking into other another process' namespace. Actually, using this model I don't think we would even strictly speaking need the explicit token FD to be included by the calling application inside the container at all? I.e., if the system policy daemon could just instruct the kernel "please delegate all permission decisions for this user namespace to me", it could - so to speak - issue tokens on demand as each call is made, instead of ahead of time. Which would both enable the policy daemon to make specific usage decisions, and wouldn't require any change needed to the applications using BPF inside the container (not even to include the BPF token FD). -Toke