From: YiFei Zhu <yifeifz2@xxxxxxxxxxxx> Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html This patchset enables seccomp filters to be written in eBPF. Supporting eBPF filters has been proposed a few times in the past. The main concerns were (1) use cases and (2) security. We have identified many use cases that can benefit from advanced eBPF filters, such as: * exec-only-once filter / apply filter after exec * syscall logging (eg. via maps) * expressiveness & better tooling (no need for DSLs like easyseccomp) * contained syscall fault injection * Temporal System Call Specialization [1] with restrictive initialization phases (serving phase syscalls are filtered) * possible future extensions such as syscall serialization and argument rewriting These features can also be achieved by user notifier + ptrace but unfortunately user notifier is a lot of context switches (see the benchmark results below), and hence much less efficient than eBPF. For security, for an unprivileged caller, our implementation is as restrictive as user notifier + ptrace, in regards to capabilities. eBPF helpers follow the privilege model of original eBPF helpers. Advanced eBPF feature (maps & helpers) is restricted by a new LSM hook seccomp_extended. If LSM permits these features, then all standard bpf helpers are permitted, and tracing helpers are permitted too if the loader is bpf_capable and perfmon_capable. Mutable privileges should not be a concern because if seccomp-eBPF is used to implement a mutable policy of privileges, such policy can be implemented using user notifier anyhow (which does not require seccomp-eBPF). Moreover, a mechanism for reading user memory is added. The same prototypes of bpf_probe_read_user{,str} from tracing are used. However, when the loader of bpf program does not have CAP_PTRACE, the helper will return -EPERM if the task under seccomp filter is non-dumpable. The reason for this is that if we perform reduction from seccomp-eBPF to user notifier + ptrace, ptrace requires CAP_PTRACE to read from a non-dumpable process. However, eBPF does not solve the TOCTOU problem of user notifier, so users should not use this to enforce a policy based on memory contents. In addition, a mechanism for storing process states between filter runs is added. This uses the BPF-LSM task storage. However, since unprivileged bpf loaders do not have access to ptr to BTF ID for use as the task parameter to the helpers, the workaround is to use NULL as the parameter, and the helper will fallback to current's group leader. This is insufficient, unfortunately, because of the BTF enforcement in bpf_local_storage_map_alloc_check, and the fact that tasks without bpf_capable cannot load map BTFs. (Can I ask why this is restricted this way?) Giuseppe Scrivano shows how to support eBPF filters in crun [2], based on which we have tested a number of stateful filters. Performance wise, Jinghao did a test of 1,000,000 getpid() calls on an Intel i7-9700K, with stock Ubuntu config. The syscalls are half EPERM and half passthrough to the getpid() syscall handler [3]. The tests are done recording a median of 10: user notif eBPF ratio QEMU 6808104 us 80508.5 us 84.6 Bare Metal 3403667.5 us 80316 us 42.4 [1] https://www.usenix.org/conference/usenixsecurity20/presentation/ghavamnia [2] https://github.com/giuseppe/crun/commit/3906b4fbcb671f8f188deef08c94ceae86a80120 [3] https://github.com/xlab-uiuc/seccomp-ebpf-upstream/tree/perf-test Patch 1 moves no_new_privs check in filter loading. Patch 2 implements basic support for seccomp-eBPF in the kernel. Patch 3 enables a ptracer to get a fd to the eBPF for CRIU. Patch 4 enables libbpf to recognize the section "seccomp". Patch 5 adds a sample program test_seccomp to samples/bpf. Patch 6 adds an LSM hook seccomp_extended. Patch 7 allows bpf verifier hooks to restrict direct map access. Patch 8 implements restrictions for eBPF filters depending on LSM hooks. Patch 9 lets Yama LSM restrict seccomp-ebpf based on ptrace_scope. Patch 10 enables seccomp-ebpf to read user memory. Patch 11 allows bpf helpers to have nullable ptr to BTF ID as argument. Patch 12 implements process storage using BPF-LSM task storage. Sargun Dhillon (3): bpf, seccomp: Add eBPF filter capabilities seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp filters samples/bpf: Add eBPF seccomp sample programs YiFei Zhu (9): seccomp: Move no_new_privs check to after prepare_filter libbpf: recognize section "seccomp" lsm: New hook seccomp_extended bpf/verifier: allow restricting direct map access seccomp-ebpf: restrict filter to almost cBPF if LSM request such yama: (concept) restrict seccomp-eBPF with ptrace_scope seccomp-ebpf: Add ability to read user memory bpf/verifier: support NULL-able ptr to BTF ID as helper argument seccomp-ebpf: support task storage from BPF-LSM, defaulting to group leader arch/Kconfig | 7 + include/linux/bpf.h | 8 ++ include/linux/bpf_types.h | 4 + include/linux/lsm_hook_defs.h | 4 + include/linux/seccomp.h | 15 +- include/linux/security.h | 13 ++ include/uapi/linux/bpf.h | 1 + include/uapi/linux/ptrace.h | 2 + include/uapi/linux/seccomp.h | 1 + kernel/bpf/bpf_task_storage.c | 64 +++++++-- kernel/bpf/syscall.c | 1 + kernel/bpf/verifier.c | 15 +- kernel/ptrace.c | 4 + kernel/seccomp.c | 235 ++++++++++++++++++++++++++++---- kernel/trace/bpf_trace.c | 42 ++++++ samples/bpf/Makefile | 3 + samples/bpf/test_seccomp_kern.c | 41 ++++++ samples/bpf/test_seccomp_user.c | 49 +++++++ security/security.c | 8 ++ security/yama/yama_lsm.c | 30 ++++ tools/include/uapi/linux/bpf.h | 1 + tools/lib/bpf/libbpf.c | 1 + 22 files changed, 511 insertions(+), 38 deletions(-) create mode 100644 samples/bpf/test_seccomp_kern.c create mode 100644 samples/bpf/test_seccomp_user.c -- 2.31.1