On Fri, Jul 7, 2023 at 4:34 AM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Wed, Jul 5, 2023 at 6:27 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > > On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > > > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > > > > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > > > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > > > > systemd or any other container manager) to a *trusted* unprivileged > > > > > application. Trust is the key here. This functionality is not about allowing > > > > > unconditional unprivileged BPF usage. Establishing trust, though, is > > > > > completely up to the discretion of respective privileged application that > > > > > would create a BPF token, as different production setups can and do achieve it > > > > > through a combination of different means (signing, LSM, code reviews, etc), > > > > > and it's undesirable and infeasible for kernel to enforce any particular way > > > > > of validating trustworthiness of particular process. > > > > > > > > > > The main motivation for BPF token is a desire to enable containerized > > > > > BPF applications to be used together with user namespaces. This is currently > > > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > > > > arbitrary memory, and it's impossible to ensure that they only read memory of > > > > > processes belonging to any given namespace. This means that it's impossible to > > > > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > > > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > > > > no assumption about what "trusted" constitutes in any particular case, and > > > > > it's up to specific privileged applications and their surrounding > > > > > infrastructure to decide that. What kernel provides is a set of APIs to create > > > > > and tune BPF token, and pass it around to privileged BPF commands that are > > > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > > > > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > > > > LSM maintainers. BPF token concept is not changing anything about LSM > > > > > approach, but can be combined with LSM hooks for very fine-grained security > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > > > > (context), which in combination with BPF LSM would allow implementing a very > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the > > > > > interest of minimizing API surface area discussions this is going to be > > > > > added in follow up patches, as it's not essential to the fundamental concept > > > > > of delegatable BPF token. > > > > > > > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and > > > > > allowing multiple independent instances of them, each with its own set of > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token > > > > > through file system (BPF FS, in this case) for cases where transferring FDs > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > > > > syscall accepts token_fd parameters explicitly for each relevant BPF command. > > > > > This addresses main concerns brought up during the /dev/bpf discussion, and > > > > > fits better with overall BPF subsystem design. > > > > > > > > > > This patch set adds a basic minimum of functionality to make BPF token useful > > > > > and to discuss API and functionality. Currently only low-level libbpf APIs > > > > > support passing BPF token around, allowing to test kernel functionality, but > > > > > for the most part is not sufficient for real-world applications, which > > > > > typically use high-level libbpf APIs based on `struct bpf_object` type. This > > > > > was done with the intent to limit the size of patch set and concentrate on > > > > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > > > > > as a separate follow up patch set kernel support makes it upstream. > > > > > > > > > > Another part that should happen once kernel-side BPF token is established, is > > > > > a set of conventions between applications (e.g., systemd), tools (e.g., > > > > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > > > > > at well-defined locations to allow applications take advantage of this in > > > > > automatic fashion without explicit code changes on BPF application's side. > > > > > But I'd like to postpone this discussion to after BPF token concept lands. > > > > > > > > > > Once important distinctions from v2 that should be noted is a chance in the > > > > > semantics of a newly added BPF_TOKEN_CREATE command. Previously, > > > > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to > > > > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN > > > > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF > > > > > token object creation *and* pinning in BPF FS. Such change ensures that BPF > > > > > token is always associated with a specific instance of BPF FS and cannot > > > > > "escape" it by application re-pinning it somewhere else using another > > > > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation, > > > > > better containing it inside intended container (under assumption BPF FS is set > > > > > up in such a way as to not be shared with other containers on the system). > > > > > > > > > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@xxxxxxxxxx/ > > > > > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > > > > > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@xxxxxx/ > > > > > > > > > > v3->v3-resend: > > > > > - I started integrating token_fd into bpf_object_open_opts and higher-level > > > > > libbpf bpf_object APIs, but it started going a bit deeper into bpf_object > > > > > implementation details and how libbpf performs feature detection and > > > > > caching, so I decided to keep it separate from this patch set and not > > > > > distract from the mostly kernel-side changes; > > > > > v2->v3: > > > > > - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow > > > > > BPF_OBJ_PIN for BPF token; > > > > > v1->v2: > > > > > - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset; > > > > > - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav). > > > > > > > > > > Andrii Nakryiko (14): > > > > > bpf: introduce BPF token object > > > > > libbpf: add bpf_token_create() API > > > > > selftests/bpf: add BPF_TOKEN_CREATE test > > > > > bpf: add BPF token support to BPF_MAP_CREATE command > > > > > libbpf: add BPF token support to bpf_map_create() API > > > > > selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command > > > > > bpf: add BPF token support to BPF_BTF_LOAD command > > > > > libbpf: add BPF token support to bpf_btf_load() API > > > > > selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest > > > > > bpf: add BPF token support to BPF_PROG_LOAD command > > > > > bpf: take into account BPF token when fetching helper protos > > > > > bpf: consistenly use BPF token throughout BPF verifier logic > > > > > libbpf: add BPF token support to bpf_prog_load() API > > > > > selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests > > > > > > > > > > drivers/media/rc/bpf-lirc.c | 2 +- > > > > > include/linux/bpf.h | 79 ++++- > > > > > include/linux/filter.h | 2 +- > > > > > include/uapi/linux/bpf.h | 53 ++++ > > > > > kernel/bpf/Makefile | 2 +- > > > > > kernel/bpf/arraymap.c | 2 +- > > > > > kernel/bpf/cgroup.c | 6 +- > > > > > kernel/bpf/core.c | 3 +- > > > > > kernel/bpf/helpers.c | 6 +- > > > > > kernel/bpf/inode.c | 46 ++- > > > > > kernel/bpf/syscall.c | 183 +++++++++--- > > > > > kernel/bpf/token.c | 201 +++++++++++++ > > > > > kernel/bpf/verifier.c | 13 +- > > > > > kernel/trace/bpf_trace.c | 2 +- > > > > > net/core/filter.c | 36 +-- > > > > > net/ipv4/bpf_tcp_ca.c | 2 +- > > > > > net/netfilter/nf_bpf_link.c | 2 +- > > > > > tools/include/uapi/linux/bpf.h | 53 ++++ > > > > > tools/lib/bpf/bpf.c | 35 ++- > > > > > tools/lib/bpf/bpf.h | 45 ++- > > > > > tools/lib/bpf/libbpf.map | 1 + > > > > > .../selftests/bpf/prog_tests/libbpf_probes.c | 4 + > > > > > .../selftests/bpf/prog_tests/libbpf_str.c | 6 + > > > > > .../testing/selftests/bpf/prog_tests/token.c | 277 ++++++++++++++++++ > > > > > 24 files changed, 957 insertions(+), 104 deletions(-) > > > > > create mode 100644 kernel/bpf/token.c > > > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c > > > > > > > > > > -- > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > > > > > Hi Andrii, > > > > > > > > Thanks for your proposal. > > > > That seems to be a useful functionality, and yet I have some questions. > > > > > > I've answered them below. But I don't think either of them have any > > > relation to BPF token and the problem I'm trying to solve. > > > > > > > > > > > 1. Why can't we add security_bpf_probe_read_{kernel,user}? > > > > If possible, we can use these LSM hooks to refuse the process to > > > > read other tasks' information. E.g. if the other process is not within > > > > the same cgroup or the same namespace, we just refuse the reading. I > > > > think it is not hard to identify if the other process is within the > > > > same cgroup or the same namespace. > > > > > > There are probably many reasons. First, performance-wide, LSM hook for > > > each bpf_probe_read_{kernel,user}() call will be prohibitive. And just > > > in general, one would need to be very careful with such LSM hooks, > > > because bpf_probe_read_{kernel,user}() often happens from NMI context, > > > and LSM policy would have to be written and validated very carefully > > > with NMI context in mind. > > > > > > But, more conceptually, for probe_read you get a random address and > > > you know the process context you are running in (but you might be > > > actually running in softirq and NMI, and that process context is > > > irrelevant). How can you efficiently (or at all) tell if that random > > > address "belongs" to cgroup or namespace? Just at conceptual level? > > > > > > > > > > > 2. Why can't we extend bpf_cookie? > > > > We're now using bpf_cookie to identify each user or each > > > > application, and only the permitted cookies can create new probe > > > > links. However we find the bpf_cookie is only supported by tracing, > > > > perf_event and kprobe_multi, so we're planning to extend it to other > > > > possible link types, then we can use LSM hooks to control all bpf > > > > links. I think that the upstream kernel should also support > > > > bpf_cookie for all bpf links. If possible, we will post it to the > > > > upstream in the future. > > > > After I have read your BPF token proposal, I just have some other > > > > ideas. Why can't we just extend bpf_cookie to all other BPF objects? > > > > For example, all progs and maps should also have the bpf_cookie. > > > > > > > > > > I'm not exactly clear how you use BPF cookie, but it wasn't intended > > > to provide any sort of security or validation policy. It's purely a > > > user-provided u64 to help distinguish different attach points when the > > > same BPF program is attached in multiple places (e.g., kprobe tracing > > > many different kernel functions and needing to distinguish between > > > them at runtime). > > > > In our container environment, we enable the CAP_BPF, CAP_PERMON and > > CAP_NET_ADMIN for the containers which want to run BPF programs > > inside. However we don't want them to run whatever BPF programs they > > want. We only allow them to run the BPF programs we have permitted for > > each of them. So we are using LSM to audit the BPF behavior such as > > prog load, map creation and link attach. We define different BPF > > policies for different containers. In order to identify different > > containers efficiently, we assign different bpf_cookies for different > > containers. bpf_cookie is a u64, that's enough for our use cases. > > I can see how you can use BPF cookies for this, but it's certainly not > an intended use case :) BPF cookie is most useful on BPF side of > things. The utilization of the bpf_cookie appid in our use case has proven to be valuable, thus we continue to rely on its functionality :) > > But what you are describing is meant to be doable with BPF token. It's > not in first patch set, but I intended to allow user to specify an > extra "user context" blog of bytes which would be stored with BPF > token. And this data should be accessible from BPF LSM programs to > make extra custom policy decisions. But we need to agree on initial > BPF token stuff first, and then build out all the rest. Sounds good. Introducing support for user context within the BPF token would enhance its utility and provide even more valuable functionality. -- Regards Yafang