On Thu, Jun 8, 2023 at 11:49 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > On 06/07, Andrii Nakryiko wrote: > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token. > > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > LSM maintainers. BPF token concept is not changing anything about LSM > > approach, but can be combined with LSM hooks for very fine-grained security > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > (context), which in combination with BPF LSM would allow implementing a very > > dynamic and fine-granular custom security policies on top of BPF token. In the > > interest of minimizing API surface area discussions this is going to be > > added in follow up patches, as it's not essential to the fundamental concept > > of delegatable BPF token. > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > difference is the idea of using virtual anon_inode file to hold BPF token and > > allowing multiple independent instances of them, each with its own set of > > restrictions. BPF pinning solves the problem of exposing such BPF token > > through file system (BPF FS, in this case) for cases where transferring FDs > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > syscall accepts token_fd parameters explicitly for each relevant BPF command. > > This addresses main concerns brought up during the /dev/bpf discussion, and > > fits better with overall BPF subsystem design. > > > > This patch set adds a basic minimum of functionality to make BPF token useful > > and to discuss API and functionality. Currently only low-level libbpf APIs > > support passing BPF token around, allowing to test kernel functionality, but > > for the most part is not sufficient for real-world applications, which > > typically use high-level libbpf APIs based on `struct bpf_object` type. This > > was done with the intent to limit the size of patch set and concentrate on > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > > as a separate follow up patch set kernel support makes it upstream. > > > > Another part that should happen once kernel-side BPF token is established, is > > a set of conventions between applications (e.g., systemd), tools (e.g., > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > > at well-defined locations to allow applications take advantage of this in > > automatic fashion without explicit code changes on BPF application's side. > > But I'd like to postpone this discussion to after BPF token concept lands. > > > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@xxxxxxxxxx/ > > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@xxxxxx/ > > > > v1->v2: > > - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset; > > - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav). > > I went through v2, everything makes sense, the only thing that is > slightly confusing to me is the bpf_token_capable() call. > The name somehow implies that the token is capable of something > where in reality the function does "return token || capable(x)". heh, "bpf_token_" part is sort of like namespace/object prefix. The intent here was to have a token-aware capable check. And yes, if we get a token during prog/map/etc construction, the assumption is that it provides all relevant permissions. > > IMO, it would be less confusing if we do something like the following, > explicitly, instead of calling a function: > > if (token || {bpf_,perfmon_,}capable(x)) ... > > (or rename to something like bpf_token_or_capable(x)) I'd rather not open-code `if (token || ...)` checks everywhere, but I can rename to `bpf_token_or_capable()` if people prefer. I erred on the side of succinctness, but if it's confusing, then best to rename? > > Up to you on whether to take any action on that. OTOH, once you > grasp what bpf_token_capable really does, it's not really a problem. Cool, thanks for taking a look!