On Wed, 05 Jul 2023 01:20:22 +0200 Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: > > > On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >> > >> Andrii Nakryiko <andrii@xxxxxxxxxx> writes: > >> > >> > This patch set introduces new BPF object, BPF token, which allows to delegate > >> > a subset of BPF functionality from privileged system-wide daemon (e.g., > >> > systemd or any other container manager) to a *trusted* unprivileged > >> > application. Trust is the key here. This functionality is not about allowing > >> > unconditional unprivileged BPF usage. Establishing trust, though, is > >> > completely up to the discretion of respective privileged application that > >> > would create a BPF token, as different production setups can and do achieve it > >> > through a combination of different means (signing, LSM, code reviews, etc), > >> > and it's undesirable and infeasible for kernel to enforce any particular way > >> > of validating trustworthiness of particular process. > >> > > >> > The main motivation for BPF token is a desire to enable containerized > >> > BPF applications to be used together with user namespaces. This is currently > >> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > >> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > >> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > >> > arbitrary memory, and it's impossible to ensure that they only read memory of > >> > processes belonging to any given namespace. This means that it's impossible to > >> > have namespace-aware CAP_BPF capability, and as such another mechanism to > >> > allow safe usage of BPF functionality is necessary. BPF token and delegation > >> > of it to a trusted unprivileged applications is such mechanism. Kernel makes > >> > no assumption about what "trusted" constitutes in any particular case, and > >> > it's up to specific privileged applications and their surrounding > >> > infrastructure to decide that. What kernel provides is a set of APIs to create > >> > and tune BPF token, and pass it around to privileged BPF commands that are > >> > creating new BPF objects like BPF programs, BPF maps, etc. > >> > >> So a colleague pointed out today that the Seccomp Notify functionality > >> would be a way to achieve your stated goal of allowing unprivileged > >> containers to (selectively) perform bpf() syscall operations. Christian > >> Brauner has a pretty nice writeup of the functionality here: > >> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development > >> > >> In fact he even mentions allowing unprivileged access to bpf() as a > >> possible use case (in the second-to-last paragraph). > >> > >> AFAICT this would enable your use case without adding any new kernel > >> functionality or changing the BPF-using applications, while allowing the > >> privileged userspace daemon to make case-by-case decisions on each > >> operation instead of granting blanket capabilities (which is my main > >> objection to the token proposal, as we discussed on the last iteration > >> of the series). > > > > It's not "blanket" capabilities. You control types or maps and > > programs that could be created. And again, CAP_SYS_ADMIN guarded. > > Please, don't give CAP_SYS_ADMIN/root permissions to applications you > > can't be sure won't do something stupid and blame kernel API for it. > > Right, I didn't mean "blanket" in the sense of "permission to do > anything on the system"; I do get that you can restrict which subset of > functionality you grant. However, *within* that subset, it's a blanket > permission grant. I.e., you can't issue a token that grants a *specific* > application permission to load a *specific* BPF program - you can only > grant a general "load any program" permission that can be used by anyone > who possesses the token. > > I guess we could in principle extend the token mechanism to allow this, > but the kernel doesn't seem like the right place to implement such a > fine-grained policy engine... > > > After all, the root process can setuid() any file and make it run with > > elevated permissions, right? Doesn't get more "blanket" than that. > > Which is exactly why setuid binaries are not generally how we implement > security delegation these days. So I don't think designing a new > mechanism this way is a good idea. > > >> So I'm curious whether you considered this as an alternative to > >> BPF_TOKEN? And if so, what your reason was for rejecting it? > >> > > > > Yes, I'm aware, Christian has a follow up short blog post specifically > > for using this for proxying BPF from privileged process ([0]). > > > > So, in short, I think it's not a good generic solution. It's very > > fragile and high-maintenance. It's still proxying BPF UAPI (except > > application does preserve illusion of using BPF syscall, yes, that > > part is good) with all the implications: needing to replicate all of > > UAPI (fetching all those FDs from another process, following all the > > pointers from another process' memory, etc), and also writing back all > > the correct things (into another process' memory): log content, > > log_true_size (out param), any other output parameters. > > Right, OK, that bit does sound pretty tedious (although I'll note that > there are people who are trying to make all this generally more > palatable[0]). [0] https://seitan.rocks/ :) Some clickbaiting for Christian: the presentation we gave a couple of weeks ago, also linked from the project website, actually credits you (slide 29/30, of course). The code is still very much draft quality (we mostly focused on demos/feasibility so far, cleaning it up now), and we didn't prove (at least not yet) that handling complicated stuff such as bpf(2) is actually convenient, but that's at least in scope as a stretch goal. I'm not claiming it's doable, but we'd give it a try. What we have at the moment is a meagre set of eight syscall models, some blatantly incomplete. A couple of comments to specific points Christian mentioned: On Tue, 4 Jul 2023 11:38:38 +0200 Christian Brauner <brauner@xxxxxxxxxx> wrote: > It's a pipe dream that you can transparently proxy system calls for > another process via seccomp for sufficiently complex system calls. We > did it for specific use-cases where we could sufficiently guarantee that > they could be safe. Right, so we're trying to pick it up from there. It's way too early to claim success, but I thought it would make sense to chime in anyway. > But to make this work it would involve way more invasive changes: > > * nesting/stacking of seccomp notifiers The need for stacked seccomp filters is obvious to me and that works more or less naturally. But why would you actually need to stack, or especially nest *notifiers* themselves? > * clean handling of pointer arguments in-kernel such that you can safely > continue system calls being sure that they haven't been modified. This > is currently only possible in scenarios where safety is guaranteed by > the kernel refusing nonsensical or unsafe arguments We're considering a couple of options. One is to never use SECCOMP_USER_NOTIF_FLAG_CONTINUE for system calls accepting pointers, or only allowing that as an explicit "unsafe" option. For a "safe" implementation, the supervisor (seitan) would in any case replay the system call, matching the context (namespaces, credentials) of the target process. If PID or TID (per se, not in terms of associated context/capabilities) of the caller matter for a specific system call, though, we simply can't support that. But that shouldn't actually be relevant for bpf(2). Strictly speaking, I think it's actually possible to "fix" this in the kernel by means of checking or copying memory that's addressable by a thread, but that might prove too invasive or end up in insurmountable layering violations. This mechanism would involve "control" paths rather than data paths, though, so the performance impact is not really worrying. Another option, which we outlined at this very convenient link: https://github.com/alicefr/community/blob/seitan/design-proposals/seitan/security-aspects-seitan.md#if-i-use-the-json-model-as-a-security-filter-can-another-thread-in-the-same-process-context-write-to-the-memory-area-pointed-to-by-system-call-arguments-while-the-calling-thread-is-blocked-and-defy-the-purpose-of-the-filter would be to make the supervisor perform a deep copy (system calls are anyway modeled in the seitan-cooker component) and then use good old ptrace(2) as needed. > * correct privilege handling > The seccomp notifier emulates system calls in userspace and thus has > to mimick the privilege context of the task it is emulating the system > call for in such a way that (i) it allows it to succeed by avoiding the > privilege limitations of why the given system call was supposed to be > proxied in the first place, (ii) it doesn't allow to circumvent other, > generic restrictions that would otherwise cause the system call to > fail. It's like saying e.g., "execute with most of the proxied task's > creds but let it have a few more privileges". That's frail as Linux > creds aren't really composable. That's why we have override_creds() > not "add_creds()" and "subtract_creds()" which would probably be > nicer. Right, at the moment we just run that as root, but we plan to take care of (ii) (albeit not solving it entirely, I guess), by at least applying a seccomp filter to the supervisor itself. As to the set of (composed?) capabilities, we don't have an answer yet. > Or it would have to be a generic first class kernel proxy which begs the > question why not change the subsystems itself to do this cleanly. Well, the fine-grained "policy" implementation we're trying to achieve looks to me like something that's a bit too complicated for the kernel, and really more appropriate for userspace. -- Stefano