On 07/14, Daniel Borkmann wrote: > This adds a generic layer called bpf_mprog which can be reused by different > attachment layers to enable multi-program attachment and dependency resolution. > In-kernel users of the bpf_mprog don't need to care about the dependency > resolution internals, they can just consume it with few API calls. > > The initial idea of having a generic API sparked out of discussion [0] from an > earlier revision of this work where tc's priority was reused and exposed via > BPF uapi as a way to coordinate dependencies among tc BPF programs, similar > as-is for classic tc BPF. The feedback was that priority provides a bad user > experience and is hard to use [1], e.g.: > > I cannot help but feel that priority logic copy-paste from old tc, netfilter > and friends is done because "that's how things were done in the past". [...] > Priority gets exposed everywhere in uapi all the way to bpftool when it's > right there for users to understand. And that's the main problem with it. > > The user don't want to and don't need to be aware of it, but uapi forces them > to pick the priority. [...] Your cover letter [0] example proves that in > real life different service pick the same priority. They simply don't know > any better. Priority is an unnecessary magic that apps _have_ to pick, so > they just copy-paste and everyone ends up using the same. > > The course of the discussion showed more and more the need for a generic, > reusable API where the "same look and feel" can be applied for various other > program types beyond just tc BPF, for example XDP today does not have multi- > program support in kernel, but also there was interest around this API for > improving management of cgroup program types. Such common multi-program > management concept is useful for BPF management daemons or user space BPF > applications coordinating internally about their attachments. > > Both from Cilium and Meta side [2], we've collected the following requirements > for a generic attach/detach/query API for multi-progs which has been implemented > as part of this work: > > - Support prog-based attach/detach and link API > - Dependency directives (can also be combined): > - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none} > - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user > space application does not need CAP_SYS_ADMIN to retrieve foreign fds > via bpf_*_get_fd_by_id() > - BPF_F_LINK flag as {prog,link} toggle > - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and > BPF_F_AFTER will just append for attaching > - Enforced only at attach time > - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their > own infra for replacing their internal prog > - If no flags are set, then it's default append behavior for attaching > - Internal revision counter and optionally being able to pass expected_revision > - User space application can query current state with revision, and pass it > along for attachment to assert current state before doing updates > - Query also gets extension for link_ids array and link_attach_flags: > - prog_ids are always filled with program IDs > - link_ids are filled with link IDs when link was used, otherwise 0 > - {prog,link}_attach_flags for holding {prog,link}-specific flags > - Must be easy to integrate/reuse for in-kernel users > > The uapi-side changes needed for supporting bpf_mprog are rather minimal, > consisting of the additions of the attachment flags, revision counter, and > expanding existing union with relative_{fd,id} member. > > The bpf_mprog framework consists of an bpf_mprog_entry object which holds > an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path > structure) is part of bpf_mprog_bundle. Both have been separated, so that > fast-path gets efficient packing of bpf_prog pointers for maximum cache > efficiency. Also, array has been chosen instead of linked list or other > structures to remove unnecessary indirections for a fast point-to-entry in > tc for BPF. > > The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of > updates the peer bpf_mprog_entry is populated and then just swapped which > avoids additional allocations that could otherwise fail, for example, in > detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could > be converted to dynamic allocation if necessary at a point in future. > Locking is deferred to the in-kernel user of bpf_mprog, for example, in case > of tcx which uses this API in the next patch, it piggybacks on rtnl. > > An extensive test suite for checking all aspects of this API for prog-based > attach/detach and link API comes as BPF selftests in this series. > > Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog > management. > > [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@xxxxxxxxxxxxx > [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@xxxxxxxxxxxxxx > [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf > > Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > --- > MAINTAINERS | 1 + > include/linux/bpf_mprog.h | 318 ++++++++++++++++++++++++ > include/uapi/linux/bpf.h | 36 ++- > kernel/bpf/Makefile | 2 +- > kernel/bpf/mprog.c | 439 +++++++++++++++++++++++++++++++++ > tools/include/uapi/linux/bpf.h | 36 ++- > 6 files changed, 815 insertions(+), 17 deletions(-) > create mode 100644 include/linux/bpf_mprog.h > create mode 100644 kernel/bpf/mprog.c > > diff --git a/MAINTAINERS b/MAINTAINERS > index d1fc5d26c699..4f3b8139c6ff 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -3684,6 +3684,7 @@ F: include/linux/filter.h > F: include/linux/tnum.h > F: kernel/bpf/core.c > F: kernel/bpf/dispatcher.c > +F: kernel/bpf/mprog.c > F: kernel/bpf/syscall.c > F: kernel/bpf/tnum.c > F: kernel/bpf/trampoline.c > diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h > new file mode 100644 > index 000000000000..6feefec43422 > --- /dev/null > +++ b/include/linux/bpf_mprog.h > @@ -0,0 +1,318 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* Copyright (c) 2023 Isovalent */ > +#ifndef __BPF_MPROG_H > +#define __BPF_MPROG_H > + > +#include <linux/bpf.h> > + > +/* bpf_mprog framework: > + * > + * bpf_mprog is a generic layer for multi-program attachment. In-kernel users > + * of the bpf_mprog don't need to care about the dependency resolution > + * internals, they can just consume it with few API calls. Currently available > + * dependency directives are BPF_F_{BEFORE,AFTER} which enable insertion of > + * a BPF program or BPF link relative to an existing BPF program or BPF link > + * inside the multi-program array as well as prepend and append behavior if > + * no relative object was specified, see corresponding selftests for concrete > + * examples (e.g. tc_links and tc_opts test cases of test_progs). > + * > + * Usage of bpf_mprog_{attach,detach,query}() core APIs with pseudo code: > + * > + * Attach case: > + * > + * struct bpf_mprog_entry *entry, *entry_new; > + * int ret; > + * > + * // bpf_mprog user-side lock > + * // fetch active @entry from attach location > + * [...] > + * ret = bpf_mprog_attach(entry, &entry_new, [...]); > + * if (!ret) { > + * if (entry != entry_new) { > + * // swap @entry to @entry_new at attach location > + * // ensure there are no inflight users of @entry: > + * synchronize_rcu(); > + * } > + * bpf_mprog_commit(entry); > + * } else { > + * // error path, bail out, propagate @ret > + * } > + * // bpf_mprog user-side unlock > + * > + * Detach case: > + * > + * struct bpf_mprog_entry *entry, *entry_new; > + * int ret; > + * > + * // bpf_mprog user-side lock > + * // fetch active @entry from attach location > + * [...] > + * ret = bpf_mprog_detach(entry, &entry_new, [...]); > + * if (!ret) { > + * // all (*) marked is optional and depends on the use-case > + * // whether bpf_mprog_bundle should be freed or not > + * if (!bpf_mprog_total(entry_new)) (*) > + * entry_new = NULL (*) > + * // swap @entry to @entry_new at attach location > + * // ensure there are no inflight users of @entry: > + * synchronize_rcu(); > + * bpf_mprog_commit(entry); > + * if (!entry_new) (*) > + * // free bpf_mprog_bundle (*) > + * } else { > + * // error path, bail out, propagate @ret > + * } > + * // bpf_mprog user-side unlock > + * > + * Query case: > + * > + * struct bpf_mprog_entry *entry; > + * int ret; > + * > + * // bpf_mprog user-side lock > + * // fetch active @entry from attach location > + * [...] > + * ret = bpf_mprog_query(attr, uattr, entry); > + * // bpf_mprog user-side unlock > + * > + * Data/fast path: > + * > + * struct bpf_mprog_entry *entry; > + * struct bpf_mprog_fp *fp; > + * struct bpf_prog *prog; > + * int ret = [...]; > + * > + * rcu_read_lock(); > + * // fetch active @entry from attach location > + * [...] > + * bpf_mprog_foreach_prog(entry, fp, prog) { > + * ret = bpf_prog_run(prog, [...]); > + * // process @ret from program > + * } > + * [...] > + * rcu_read_unlock(); > + * > + * bpf_mprog locking considerations: > + * > + * bpf_mprog_{attach,detach,query}() must be protected by an external lock > + * (like RTNL in case of tcx). > + * > + * bpf_mprog_entry pointer can be an __rcu annotated pointer (in case of tcx > + * the netdevice has tcx_ingress and tcx_egress __rcu pointer) which gets > + * updated via rcu_assign_pointer() pointing to the active bpf_mprog_entry of > + * the bpf_mprog_bundle. > + * > + * Fast path accesses the active bpf_mprog_entry within RCU critical section > + * (in case of tcx it runs in NAPI which provides RCU protection there, > + * other users might need explicit rcu_read_lock()). The bpf_mprog_commit() > + * assumes that for the old bpf_mprog_entry there are no inflight users > + * anymore. > + * > + * The READ_ONCE()/WRITE_ONCE() pairing for bpf_mprog_fp's prog access is for > + * the replacement case where we don't swap the bpf_mprog_entry. > + */ > + > +#define bpf_mprog_foreach_tuple(entry, fp, cp, t) \ > + for (fp = &entry->fp_items[0], cp = &entry->parent->cp_items[0];\ > + ({ \ > + t.prog = READ_ONCE(fp->prog); \ > + t.link = cp->link; \ > + t.prog; \ > + }); \ > + fp++, cp++) > + > +#define bpf_mprog_foreach_prog(entry, fp, p) \ > + for (fp = &entry->fp_items[0]; \ > + (p = READ_ONCE(fp->prog)); \ > + fp++) > + > +#define BPF_MPROG_MAX 64 > + > +struct bpf_mprog_fp { > + struct bpf_prog *prog; > +}; > + > +struct bpf_mprog_cp { > + struct bpf_link *link; > +}; > + > +struct bpf_mprog_entry { > + struct bpf_mprog_fp fp_items[BPF_MPROG_MAX]; > + struct bpf_mprog_bundle *parent; > +}; > + > +struct bpf_mprog_bundle { > + struct bpf_mprog_entry a; > + struct bpf_mprog_entry b; > + struct bpf_mprog_cp cp_items[BPF_MPROG_MAX]; > + struct bpf_prog *ref; > + atomic64_t revision; > + u32 count; > +}; > + > +struct bpf_tuple { > + struct bpf_prog *prog; > + struct bpf_link *link; > +}; > + > +static inline struct bpf_mprog_entry * > +bpf_mprog_peer(const struct bpf_mprog_entry *entry) > +{ > + if (entry == &entry->parent->a) > + return &entry->parent->b; > + else > + return &entry->parent->a; > +} > + > +static inline void bpf_mprog_bundle_init(struct bpf_mprog_bundle *bundle) > +{ > + BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64)); > + BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) != > + ARRAY_SIZE(bundle->cp_items)); > + > + memset(bundle, 0, sizeof(*bundle)); > + atomic64_set(&bundle->revision, 1); > + bundle->a.parent = bundle; > + bundle->b.parent = bundle; > +} > + > +static inline void bpf_mprog_inc(struct bpf_mprog_entry *entry) > +{ > + entry->parent->count++; > +} > + > +static inline void bpf_mprog_dec(struct bpf_mprog_entry *entry) > +{ > + entry->parent->count--; > +} > + > +static inline int bpf_mprog_max(void) > +{ > + return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1; > +} > + > +static inline int bpf_mprog_total(struct bpf_mprog_entry *entry) > +{ > + int total = entry->parent->count; > + > + WARN_ON_ONCE(total > bpf_mprog_max()); > + return total; > +} > + > +static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry, > + struct bpf_prog *prog) > +{ > + const struct bpf_mprog_fp *fp; > + const struct bpf_prog *tmp; > + > + bpf_mprog_foreach_prog(entry, fp, tmp) { > + if (tmp == prog) > + return true; > + } > + return false; > +} > + > +static inline void bpf_mprog_mark_for_release(struct bpf_mprog_entry *entry, > + struct bpf_tuple *tuple) > +{ > + WARN_ON_ONCE(entry->parent->ref); > + if (!tuple->link) > + entry->parent->ref = tuple->prog; > +} > + > +static inline void bpf_mprog_complete_release(struct bpf_mprog_entry *entry) > +{ > + /* In the non-link case prog deletions can only drop the reference > + * to the prog after the bpf_mprog_entry got swapped and the > + * bpf_mprog ensured that there are no inflight users anymore. > + * > + * Paired with bpf_mprog_mark_for_release(). > + */ > + if (entry->parent->ref) { > + bpf_prog_put(entry->parent->ref); > + entry->parent->ref = NULL; > + } > +} > + > +static inline void bpf_mprog_revision_new(struct bpf_mprog_entry *entry) > +{ > + atomic64_inc(&entry->parent->revision); > +} > + > +static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry) > +{ > + bpf_mprog_complete_release(entry); > + bpf_mprog_revision_new(entry); > +} > + > +static inline u64 bpf_mprog_revision(struct bpf_mprog_entry *entry) > +{ > + return atomic64_read(&entry->parent->revision); > +} > + > +static inline void bpf_mprog_entry_copy(struct bpf_mprog_entry *dst, > + struct bpf_mprog_entry *src) > +{ > + memcpy(dst->fp_items, src->fp_items, sizeof(src->fp_items)); > +} > + > +static inline void bpf_mprog_entry_grow(struct bpf_mprog_entry *entry, int idx) > +{ > + int total = bpf_mprog_total(entry); > + > + memmove(entry->fp_items + idx + 1, > + entry->fp_items + idx, > + (total - idx) * sizeof(struct bpf_mprog_fp)); > + > + memmove(entry->parent->cp_items + idx + 1, > + entry->parent->cp_items + idx, > + (total - idx) * sizeof(struct bpf_mprog_cp)); > +} > + > +static inline void bpf_mprog_entry_shrink(struct bpf_mprog_entry *entry, int idx) > +{ > + /* Total array size is needed in this case to enure the NULL > + * entry is copied at the end. > + */ > + int total = ARRAY_SIZE(entry->fp_items); > + > + memmove(entry->fp_items + idx, > + entry->fp_items + idx + 1, > + (total - idx - 1) * sizeof(struct bpf_mprog_fp)); > + > + memmove(entry->parent->cp_items + idx, > + entry->parent->cp_items + idx + 1, > + (total - idx - 1) * sizeof(struct bpf_mprog_cp)); > +} > + > +static inline void bpf_mprog_read(struct bpf_mprog_entry *entry, u32 idx, > + struct bpf_mprog_fp **fp, > + struct bpf_mprog_cp **cp) > +{ > + *fp = &entry->fp_items[idx]; > + *cp = &entry->parent->cp_items[idx]; > +} > + > +static inline void bpf_mprog_write(struct bpf_mprog_fp *fp, > + struct bpf_mprog_cp *cp, > + struct bpf_tuple *tuple) > +{ > + WRITE_ONCE(fp->prog, tuple->prog); > + cp->link = tuple->link; > +} > + > +int bpf_mprog_attach(struct bpf_mprog_entry *entry, > + struct bpf_mprog_entry **entry_new, > + struct bpf_prog *prog_new, struct bpf_link *link, > + struct bpf_prog *prog_old, > + u32 flags, u32 id_or_fd, u64 revision); > + > +int bpf_mprog_detach(struct bpf_mprog_entry *entry, > + struct bpf_mprog_entry **entry_new, > + struct bpf_prog *prog, struct bpf_link *link, > + u32 flags, u32 id_or_fd, u64 revision); > + > +int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr, > + struct bpf_mprog_entry *entry); > + > +#endif /* __BPF_MPROG_H */ > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 600d0caebbd8..1c166870cdf3 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -1113,7 +1113,12 @@ enum bpf_perf_event_type { > */ > #define BPF_F_ALLOW_OVERRIDE (1U << 0) > #define BPF_F_ALLOW_MULTI (1U << 1) > +/* Generic attachment flags. */ > #define BPF_F_REPLACE (1U << 2) > +#define BPF_F_BEFORE (1U << 3) > +#define BPF_F_AFTER (1U << 4) > +#define BPF_F_ID (1U << 5) > +#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */ > > /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the > * verifier will perform strict alignment checking as if the kernel > @@ -1444,14 +1449,19 @@ union bpf_attr { > }; > > struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ > - __u32 target_fd; /* container object to attach to */ > - __u32 attach_bpf_fd; /* eBPF program to attach */ > + union { > + __u32 target_fd; /* target object to attach to or ... */ > + __u32 target_ifindex; /* target ifindex */ > + }; > + __u32 attach_bpf_fd; > __u32 attach_type; > __u32 attach_flags; > - __u32 replace_bpf_fd; /* previously attached eBPF > - * program to replace if > - * BPF_F_REPLACE is used > - */ > + __u32 replace_bpf_fd; > + union { > + __u32 relative_fd; > + __u32 relative_id; > + }; > + __u64 expected_revision; > }; > > struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */ > @@ -1497,16 +1507,26 @@ union bpf_attr { > } info; > > struct { /* anonymous struct used by BPF_PROG_QUERY command */ > - __u32 target_fd; /* container object to query */ > + union { > + __u32 target_fd; /* target object to query or ... */ > + __u32 target_ifindex; /* target ifindex */ > + }; > __u32 attach_type; > __u32 query_flags; > __u32 attach_flags; > __aligned_u64 prog_ids; > - __u32 prog_cnt; > + union { > + __u32 prog_cnt; > + __u32 count; > + }; > + __u32 :32; > /* output: per-program attach_flags. > * not allowed to be set during effective query. > */ > __aligned_u64 prog_attach_flags; > + __aligned_u64 link_ids; > + __aligned_u64 link_attach_flags; > + __u64 revision; > } query; > > struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > index 1d3892168d32..1bea2eb912cd 100644 > --- a/kernel/bpf/Makefile > +++ b/kernel/bpf/Makefile > @@ -12,7 +12,7 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list > obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o > obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o > obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o > -obj-$(CONFIG_BPF_SYSCALL) += disasm.o > +obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o > obj-$(CONFIG_BPF_JIT) += trampoline.o > obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o > obj-$(CONFIG_BPF_JIT) += dispatcher.o > diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c > new file mode 100644 > index 000000000000..f6c0b015114f > --- /dev/null > +++ b/kernel/bpf/mprog.c > @@ -0,0 +1,439 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright (c) 2023 Isovalent */ > + > +#include <linux/bpf.h> > +#include <linux/bpf_mprog.h> > + > +static int bpf_mprog_link(struct bpf_tuple *tuple, > + u32 id_or_fd, u32 flags, > + enum bpf_prog_type type) > +{ > + bool id = flags & BPF_F_ID; > + struct bpf_link *link; > + > + if (id) > + link = bpf_link_by_id(id_or_fd); > + else > + link = bpf_link_get_from_fd(id_or_fd); > + if (IS_ERR(link)) > + return PTR_ERR(link); > + if (type && link->prog->type != type) { > + bpf_link_put(link); > + return -EINVAL; > + } > + > + tuple->link = link; > + tuple->prog = link->prog; > + return 0; > +} > + > +static int bpf_mprog_prog(struct bpf_tuple *tuple, > + u32 id_or_fd, u32 flags, > + enum bpf_prog_type type) > +{ > + bool id = flags & BPF_F_ID; > + struct bpf_prog *prog; > + > + if (id) > + prog = bpf_prog_by_id(id_or_fd); > + else > + prog = bpf_prog_get(id_or_fd); > + if (IS_ERR(prog)) { > + if (!id_or_fd && !id) > + return 0; Andrii was asking whether we need to explicitly reject zero in [0] and we've been chatting with Alexei about the same in [1]. So are we trying to be more flexible here on purpose or should we outright return -EINVAL for 0 id_or_fd? (or am I misreading this part?) The rest looks great, thanks for the docs indeed! 0: https://lore.kernel.org/bpf/CAEf4Bza_X30yLPm0Lhy2c-u1Qw1Ci9AVoy5jo_XXCaT9zz+3jg@xxxxxxxxxxxxxx/ 1: https://lore.kernel.org/bpf/CAKH8qBsr5vYijQSVv0EO8TF7zfoAdAaWC8jpVKK_nGSgAoyiQg@xxxxxxxxxxxxxx/#t