On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > On 06/07, Daniel Borkmann wrote: > > This adds a generic layer called bpf_mprog which can be reused by different > > attachment layers to enable multi-program attachment and dependency resolution. > > In-kernel users of the bpf_mprog don't need to care about the dependency > > resolution internals, they can just consume it with few API calls. > > > > The initial idea of having a generic API sparked out of discussion [0] from an > > earlier revision of this work where tc's priority was reused and exposed via > > BPF uapi as a way to coordinate dependencies among tc BPF programs, similar > > as-is for classic tc BPF. The feedback was that priority provides a bad user > > experience and is hard to use [1], e.g.: > > > > I cannot help but feel that priority logic copy-paste from old tc, netfilter > > and friends is done because "that's how things were done in the past". [...] > > Priority gets exposed everywhere in uapi all the way to bpftool when it's > > right there for users to understand. And that's the main problem with it. > > > > The user don't want to and don't need to be aware of it, but uapi forces them > > to pick the priority. [...] Your cover letter [0] example proves that in > > real life different service pick the same priority. They simply don't know > > any better. Priority is an unnecessary magic that apps _have_ to pick, so > > they just copy-paste and everyone ends up using the same. > > > > The course of the discussion showed more and more the need for a generic, > > reusable API where the "same look and feel" can be applied for various other > > program types beyond just tc BPF, for example XDP today does not have multi- > > program support in kernel, but also there was interest around this API for > > improving management of cgroup program types. Such common multi-program > > management concept is useful for BPF management daemons or user space BPF > > applications coordinating about their attachments. > > > > Both from Cilium and Meta side [2], we've collected the following requirements > > for a generic attach/detach/query API for multi-progs which has been implemented > > as part of this work: > > > > - Support prog-based attach/detach and link API > > - Dependency directives (can also be combined): > > - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none} > > - BPF_F_ID flag as {fd,id} toggle > > - BPF_F_LINK flag as {prog,link} toggle > > - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and > > BPF_F_AFTER will just append for the case of attaching > > - Enforced only at attach time > > - BPF_F_{FIRST,LAST} > > - Enforced throughout the bpf_mprog state's lifetime > > - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE) > > - Internal revision counter and optionally being able to pass expected_revision > > - User space daemon can query current state with revision, and pass it along > > for attachment to assert current state before doing updates > > - Query also gets extension for link_ids array and link_attach_flags: > > - prog_ids are always filled with program IDs > > - link_ids are filled with link IDs when link was used, otherwise 0 > > - {prog,link}_attach_flags for holding {prog,link}-specific flags > > - Must be easy to integrate/reuse for in-kernel users > > > > The uapi-side changes needed for supporting bpf_mprog are rather minimal, > > consisting of the additions of the attachment flags, revision counter, and > > expanding existing union with relative_{fd,id} member. > > > > The bpf_mprog framework consists of an bpf_mprog_entry object which holds > > an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path > > structure). Both have been separated, so that fast-path gets efficient packing > > of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen > > instead of linked list or other structures to remove unnecessary indirections > > for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair > > via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry > > is populated and then just swapped which avoids additional allocations that > > could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are > > currently static, but they could be converted to dynamic allocation if necessary > > at a point in future. Locking is deferred to the in-kernel user of bpf_mprog, > > for example, in case of tcx which uses this API in the next patch, it piggy- > > backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail, > > add,del} implementation and an extensive test suite for checking all aspects > > of this API for prog-based attach/detach and link API as BPF selftests in > > this series. > > > > Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon. > > > > [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@xxxxxxxxxxxxx/ > > [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@xxxxxxxxxxxxxx > > [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf > > > > Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > > --- > > MAINTAINERS | 1 + > > include/linux/bpf_mprog.h | 245 +++++++++++++++++ > > include/uapi/linux/bpf.h | 37 ++- > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/mprog.c | 476 +++++++++++++++++++++++++++++++++ > > tools/include/uapi/linux/bpf.h | 37 ++- > > 6 files changed, 781 insertions(+), 17 deletions(-) > > create mode 100644 include/linux/bpf_mprog.h > > create mode 100644 kernel/bpf/mprog.c > > [...] > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > > index a7b5e91dd768..207f8a37b327 100644 > > --- a/tools/include/uapi/linux/bpf.h > > +++ b/tools/include/uapi/linux/bpf.h > > @@ -1102,7 +1102,14 @@ enum bpf_link_type { > > */ > > #define BPF_F_ALLOW_OVERRIDE (1U << 0) > > #define BPF_F_ALLOW_MULTI (1U << 1) > > +/* Generic attachment flags. */ > > #define BPF_F_REPLACE (1U << 2) > > +#define BPF_F_BEFORE (1U << 3) > > +#define BPF_F_AFTER (1U << 4) > > [..] > > > +#define BPF_F_FIRST (1U << 5) > > +#define BPF_F_LAST (1U << 6) > > I'm still not sure whether the hard semantics of first/last is really > useful. My worry is that some prog will just use BPF_F_FIRST which > would prevent the rest of the users.. (starting with only > F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really > need first/laste). Without FIRST/LAST some scenarios cannot be guaranteed to be safely implemented. E.g., if I have some hard audit requirements and I need to guarantee that my program runs first and observes each event, I'll enforce BPF_F_FIRST when attaching it. And if that attachment fails, then server setup is broken and my application cannot function. In a setup where we expect multiple applications to co-exist, it should be a rule that no one is using FIRST/LAST (unless it's absolutely required). And if someone doesn't comply, then that's a bug and has to be reported to application owners. But it's not up to the kernel to enforce this cooperation by disallowing FIRST/LAST semantics, because that semantics is critical for some applications, IMO. > > But if everyone besides myself is on board with first/last, maybe at least > put a comment here saying that only a single program can be first/last? > And the users are advised not to use these unless they really really really > need to be first/last. (IOW, feels like first/last should be reserved > for observability tools/etc). +1, we can definitely make it clear in API that this will prevent anyone else from being attached as FIRST/LAST, so it's not cooperative in nature and has to be very consciously evaluated. > > > +#define BPF_F_ID (1U << 7) > > +#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */ > > > > /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the > > * verifier will perform strict alignment checking as if the kernel [...]