Re: [PATCH v8 bpf-next 00/18] BPF token and BPF FS-based delegation

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 24 Oct 2023 10:52:09 -0700

On Mon, Oct 16, 2023 at 11:04 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
>
> This patch set introduces an ability to delegate a subset of BPF subsystem
> functionality from privileged system-wide daemon (e.g., systemd or any other
> container manager) through special mount options for userns-bound BPF FS to
> a *trusted* unprivileged application. Trust is the key here. This
> functionality is not about allowing unconditional unprivileged BPF usage.
> Establishing trust, though, is completely up to the discretion of respective
> privileged application that would create and mount a BPF FS instance with
> delegation enabled, as different production setups can and do achieve it
> through a combination of different means (signing, LSM, code reviews, etc),
> and it's undesirable and infeasible for kernel to enforce any particular way
> of validating trustworthiness of particular process.
>
> The main motivation for this work is a desire to enable containerized BPF
> applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have a mechanically verifiable namespace-aware CAP_BPF capability, and as such
> another mechanism to allow safe usage of BPF functionality is necessary.BPF FS
> delegation mount options and BPF token derived from such BPF FS instance is
> such a mechanism. Kernel makes no assumption about what "trusted" constitutes
> in any particular case, and it's up to specific privileged applications and
> their surrounding infrastructure to decide that. What kernel provides is a set
> of APIs to setup and mount special BPF FS instanecs and derive BPF tokens from
> it. BPF FS and BPF token are both bound to its owning userns and in such a way
> are constrained inside intended container. Users can then pass BPF token FD to
> privileged bpf() syscall commands, like BPF map creation and BPF program
> loading, to perform such operations without having init userns privileged.
>
> This version incorporates feedback and suggestions ([3]) received on v3 of
> this patch set, and instead of allowing to create BPF tokens directly assuming
> capable(CAP_SYS_ADMIN), we instead enhance BPF FS to accepts a few new
> delegation mount options. If these options are used and BPF FS itself is
> properly created, set up, and mounted inside the user namespaced container,
> user application is able to derive a BPF token object from BPF FS instance,
> and pass that token to bpf() syscall. As explained in patch #2, BPF token
> itself doesn't grant access to BPF functionality, but instead allows kernel to
> do namespaced capabilities checks (ns_capable() vs capable()) for CAP_BPF,
> CAP_PERFMON, CAP_NET_ADMIN, and CAP_SYS_ADMIN, as applicable. So it forms one
> half of a puzzle and allows container managers and sys admins to have safe and
> flexible configuration options: determining which containers get delegation of
> BPF functionality through BPF FS, and then which applications within such
> containers are allowed to perform bpf() commands, based on namespaces
> capabilities.
>
> Previous attempt at addressing this very same problem ([0]) attempted to
> utilize authoritative LSM approach, but was conclusively rejected by upstream
> LSM maintainers. BPF token concept is not changing anything about LSM
> approach, but can be combined with LSM hooks for very fine-grained security
> policy. Some ideas about making BPF token more convenient to use with LSM (in
> particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> 2023 presentation ([1]). E.g., an ability to specify user-provided data
> (context), which in combination with BPF LSM would allow implementing a very
> dynamic and fine-granular custom security policies on top of BPF token. In the
> interest of minimizing API surface area and discussions this was relegated to
> follow up patches, as it's not essential to the fundamental concept of
> delegatable BPF token.
>
> It should be noted that BPF token is conceptually quite similar to the idea of
> /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> difference is the idea of using virtual anon_inode file to hold BPF token and
> allowing multiple independent instances of them, each (potentially) with its
> own set of restrictions. And also, crucially, BPF token approach is not using
> any special stateful task-scoped flags. Instead, bpf() syscall accepts
> token_fd parameters explicitly for each relevant BPF command. This addresses
> main concerns brought up during the /dev/bpf discussion, and fits better with
> overall BPF subsystem design.
>
> This patch set adds a basic minimum of functionality to make BPF token idea
> useful and to discuss API and functionality. Currently only low-level libbpf
> APIs support creating and passing BPF token around, allowing to test kernel
> functionality, but for the most part is not sufficient for real-world
> applications, which typically use high-level libbpf APIs based on `struct
> bpf_object` type. This was done with the intent to limit the size of patch set
> and concentrate on mostly kernel-side changes. All the necessary plumbing for
> libbpf will be sent as a separate follow up patch set kernel support makes it
> upstream.
>
> Another part that should happen once kernel-side BPF token is established, is
> a set of conventions between applications (e.g., systemd), tools (e.g.,
> bpftool), and libraries (e.g., libbpf) on exposing delegatable BPF FS
> instance(s) at well-defined locations to allow applications take advantage of
> this in automatic fashion without explicit code changes on BPF application's
> side. But I'd like to postpone this discussion to after BPF token concept
> lands.
>
>   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@xxxxxxxxxx/
>   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
>   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@xxxxxx/
>   [3] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
>
> v7->v8:
>   - add bpf_token_allow_cmd and bpf_token_capable hooks (Paul);
>   - inline bpf_token_alloc() into bpf_token_create() to prevent accidental
>     divergence with security_bpf_token_create() hook (Paul);

Hi Paul,

I believe I addressed all the concerns you had in this revision. Can
you please take a look and confirm that all things look good to you
from LSM perspective? Thanks!

> v6->v7:
>   - separate patches to refactor bpf_prog_alloc/bpf_map_alloc LSM hooks, as
>     discussed with Paul, and now they also accept struct bpf_token;
>   - added bpf_token_create/bpf_token_free to allow LSMs (SELinux,
>     specifically) to set up security LSM blob (Paul);
>   - last patch also wires bpf_security_struct setup by SELinux, similar to how
>     it's done for BPF map/prog, though I'm not sure if that's enough, so worst
>     case it's easy to drop this patch if more full fledged SELinux
>     implementation will be done separately;
>   - small fixes for issues caught by code reviews (Jiri, Hou);
>   - fix for test_maps test that doesn't use LIBBPF_OPTS() macro (CI);
> v5->v6:
>   - fix possible use of uninitialized variable in selftests (CI);
>   - don't use anon_inode, instead create one from BPF FS instance (Christian);
>   - don't store bpf_token inside struct bpf_map, instead pass it explicitly to
>     map_check_btf(). We do store bpf_token inside prog->aux, because it's used
>     during verification and even can be checked during attach time for some
>     program types;
>   - LSM hooks are left intact pending the conclusion of discussion with Paul
>     Moore; I'd prefer to do LSM-related changes as a follow up patch set
>     anyways;
> v4->v5:
>   - add pre-patch unifying CAP_NET_ADMIN handling inside kernel/bpf/syscall.c
>     (Paul Moore);
>   - fix build warnings and errors in selftests and kernel, detected by CI and
>     kernel test robot;
> v3->v4:
>   - add delegation mount options to BPF FS;
>   - BPF token is derived from the instance of BPF FS and associates itself
>     with BPF FS' owning userns;
>   - BPF token doesn't grant BPF functionality directly, it just turns
>     capable() checks into ns_capable() checks within BPF FS' owning user;
>   - BPF token cannot be pinned;
> v2->v3:
>   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
>     BPF_OBJ_PIN for BPF token;
> v1->v2:
>   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
>   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
>
> Andrii Nakryiko (18):
>   bpf: align CAP_NET_ADMIN checks with bpf_capable() approach
>   bpf: add BPF token delegation mount options to BPF FS
>   bpf: introduce BPF token object
>   bpf: add BPF token support to BPF_MAP_CREATE command
>   bpf: add BPF token support to BPF_BTF_LOAD command
>   bpf: add BPF token support to BPF_PROG_LOAD command
>   bpf: take into account BPF token when fetching helper protos
>   bpf: consistenly use BPF token throughout BPF verifier logic
>   bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks
>   bpf,lsm: refactor bpf_map_alloc/bpf_map_free LSM hooks
>   bpf,lsm: add BPF token LSM hooks
>   libbpf: add bpf_token_create() API
>   selftests/bpf: fix test_maps' use of bpf_map_create_opts
>   libbpf: add BPF token support to bpf_map_create() API
>   libbpf: add BPF token support to bpf_btf_load() API
>   libbpf: add BPF token support to bpf_prog_load() API
>   selftests/bpf: add BPF token-enabled tests
>   bpf,selinux: allocate bpf_security_struct per BPF token
>
>  drivers/media/rc/bpf-lirc.c                   |   2 +-
>  include/linux/bpf.h                           |  83 ++-
>  include/linux/filter.h                        |   2 +-
>  include/linux/lsm_hook_defs.h                 |  15 +-
>  include/linux/security.h                      |  43 +-
>  include/uapi/linux/bpf.h                      |  44 ++
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/arraymap.c                         |   2 +-
>  kernel/bpf/bpf_lsm.c                          |  15 +-
>  kernel/bpf/cgroup.c                           |   6 +-
>  kernel/bpf/core.c                             |   3 +-
>  kernel/bpf/helpers.c                          |   6 +-
>  kernel/bpf/inode.c                            |  98 ++-
>  kernel/bpf/syscall.c                          | 215 ++++--
>  kernel/bpf/token.c                            | 247 +++++++
>  kernel/bpf/verifier.c                         |  13 +-
>  kernel/trace/bpf_trace.c                      |   2 +-
>  net/core/filter.c                             |  36 +-
>  net/ipv4/bpf_tcp_ca.c                         |   2 +-
>  net/netfilter/nf_bpf_link.c                   |   2 +-
>  security/security.c                           | 101 ++-
>  security/selinux/hooks.c                      |  47 +-
>  tools/include/uapi/linux/bpf.h                |  44 ++
>  tools/lib/bpf/bpf.c                           |  30 +-
>  tools/lib/bpf/bpf.h                           |  39 +-
>  tools/lib/bpf/libbpf.map                      |   1 +
>  .../bpf/map_tests/map_percpu_stats.c          |  20 +-
>  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
>  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
>  .../testing/selftests/bpf/prog_tests/token.c  | 629 ++++++++++++++++++
>  30 files changed, 1577 insertions(+), 182 deletions(-)
>  create mode 100644 kernel/bpf/token.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
>
> --
> 2.34.1
>
>