On Wed, Jun 7, 2023 at 12:27 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > This work refactors and adds a lightweight extension ("tcx") to the tc BPF > ingress and egress data path side for allowing BPF program management based > on fds via bpf() syscall through the newly added generic multi-prog API. > The main goal behind this work which we also presented at LPC [0] last year > and a recent update at LSF/MM/BPF this year [3] is to support long-awaited > BPF link functionality for tc BPF programs, which allows for a model of safe > ownership and program detachment. > > Given the rise in tc BPF users in cloud native environments, this becomes > necessary to avoid hard to debug incidents either through stale leftover > programs or 3rd party applications accidentally stepping on each others toes. > As a recap, a BPF link represents the attachment of a BPF program to a BPF > hook point. The BPF link holds a single reference to keep BPF program alive. > Moreover, hook points do not reference a BPF link, only the application's > fd or pinning does. A BPF link holds meta-data specific to attachment and > implements operations for link creation, (atomic) BPF program update, > detachment and introspection. The motivation for BPF links for tc BPF programs > is multi-fold, for example: > > - From Meta: "It's especially important for applications that are deployed > fleet-wide and that don't "control" hosts they are deployed to. If such > application crashes and no one notices and does anything about that, BPF > program will keep running draining resources or even just, say, dropping > packets. We at FB had outages due to such permanent BPF attachment > semantics. With fd-based BPF link we are getting a framework, which allows > safe, auto-detachable behavior by default, unless application explicitly > opts in by pinning the BPF link." [1] > > - From Cilium-side the tc BPF programs we attach to host-facing veth devices > and phys devices build the core datapath for Kubernetes Pods, and they > implement forwarding, load-balancing, policy, EDT-management, etc, within > BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently > experienced hard-to-debug issues in a user's staging environment where > another Kubernetes application using tc BPF attached to the same prio/handle > of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath > it. The goal is to establish a clear/safe ownership model via links which > cannot accidentally be overridden. [0,2] > > BPF links for tc can co-exist with non-link attachments, and the semantics are > in line also with XDP links: BPF links cannot replace other BPF links, BPF > links cannot replace non-BPF links, non-BPF links cannot replace BPF links and > lastly only non-BPF links can replace non-BPF links. In case of Cilium, this > would solve mentioned issue of safe ownership model as 3rd party applications > would not be able to accidentally wipe Cilium programs, even if they are not > BPF link aware. > > Earlier attempts [4] have tried to integrate BPF links into core tc machinery > to solve cls_bpf, which has been intrusive to the generic tc kernel API with > extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could > be wiped from the qdisc also. Locking a tc BPF program in place this way, is > getting into layering hacks given the two object models are vastly different. > > We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF > attach API, so that the BPF link implementation blends in naturally similar to > other link types which are fd-based and without the need for changing core tc > internal APIs. BPF programs for tc can then be successively migrated from classic > cls_bpf to the new tc BPF link without needing to change the program's source > code, just the BPF loader mechanics for attaching is sufficient. > > For the current tc framework, there is no change in behavior with this change > and neither does this change touch on tc core kernel APIs. The gist of this > patch is that the ingress and egress hook have a lightweight, qdisc-less > extension for BPF to attach its tc BPF programs, in other words, a minimal > entry point for tc BPF. The name tcx has been suggested from discussion of > earlier revisions of this work as a good fit, and to more easily differ between > the classic cls_bpf attachment and the fd-based one. > > For the ingress and egress tcx points, the device holds a cache-friendly array > with program pointers which is separated from control plane (slow-path) data. > Earlier versions of this work used priority to determine ordering and expression > of dependencies similar as with classic tc, but it was challenged that for > something more future-proof a better user experience is required. Hence this > resulted in the design and development of the generic attach/detach/query API > for multi-progs. See prior patch with its discussion on the API design. tcx is > the first user and later we plan to integrate also others, for example, one > candidate is multi-prog support for XDP which would benefit and have the same > 'look and feel' from API perspective. > > The goal with tcx is to have maximum compatibility to existing tc BPF programs, > so they don't need to be rewritten specifically. Compatibility to call into > classic tcf_classify() is also provided in order to allow successive migration > or both to cleanly co-exist where needed given its all one logical tc layer. > tcx supports the simplified return codes TCX_NEXT which is non-terminating (go > to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. > The fd-based API is behind a static key, so that when unused the code is also > not entered. The struct tcx_entry's program array is currently static, but > could be made dynamic if necessary at a point in future. The a/b pair swap > design has been chosen so that for detachment there are no allocations which > otherwise could fail. The work has been tested with tc-testing selftest suite > which all passes, as well as the tc BPF tests from the BPF CI, and also with > Cilium's L4LB. > > Kudos also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews > of this work. > > [0] https://lpc.events/event/16/contributions/1353/ > [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@xxxxxxxxxxxxxx/ > [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog > [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf > [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@xxxxxxxxx/ > > Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > --- > MAINTAINERS | 4 +- > include/linux/netdevice.h | 15 +- > include/linux/skbuff.h | 4 +- > include/net/sch_generic.h | 2 +- > include/net/tcx.h | 157 +++++++++++++++ > include/uapi/linux/bpf.h | 35 +++- > kernel/bpf/Kconfig | 1 + > kernel/bpf/Makefile | 1 + > kernel/bpf/syscall.c | 95 +++++++-- > kernel/bpf/tcx.c | 347 +++++++++++++++++++++++++++++++++ > net/Kconfig | 5 + > net/core/dev.c | 267 +++++++++++++++---------- > net/core/filter.c | 4 +- > net/sched/Kconfig | 4 +- > net/sched/sch_ingress.c | 45 ++++- > tools/include/uapi/linux/bpf.h | 35 +++- > 16 files changed, 877 insertions(+), 144 deletions(-) > create mode 100644 include/net/tcx.h > create mode 100644 kernel/bpf/tcx.c > [...] > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 207f8a37b327..e7584e24bc83 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -1035,6 +1035,8 @@ enum bpf_attach_type { > BPF_TRACE_KPROBE_MULTI, > BPF_LSM_CGROUP, > BPF_STRUCT_OPS, > + BPF_TCX_INGRESS, > + BPF_TCX_EGRESS, > __MAX_BPF_ATTACH_TYPE > }; > > @@ -1052,7 +1054,7 @@ enum bpf_link_type { > BPF_LINK_TYPE_KPROBE_MULTI = 8, > BPF_LINK_TYPE_STRUCT_OPS = 9, > BPF_LINK_TYPE_NETFILTER = 10, > - > + BPF_LINK_TYPE_TCX = 11, > MAX_BPF_LINK_TYPE, > }; > > @@ -1559,13 +1561,13 @@ union bpf_attr { > __u32 map_fd; /* struct_ops to attach */ > }; > union { > - __u32 target_fd; /* object to attach to */ > - __u32 target_ifindex; /* target ifindex */ > + __u32 target_fd; /* target object to attach to or ... */ > + __u32 target_ifindex; /* target ifindex */ > }; > __u32 attach_type; /* attach type */ > __u32 flags; /* extra flags */ > union { > - __u32 target_btf_id; /* btf_id of target to attach to */ > + __u32 target_btf_id; /* btf_id of target to attach to */ nit: should this part be in patch 1? > struct { > __aligned_u64 iter_info; /* extra bpf_iter_link_info */ > __u32 iter_info_len; /* iter_info length */ > @@ -1599,6 +1601,13 @@ union bpf_attr { > __s32 priority; > __u32 flags; > } netfilter; > + struct { > + union { > + __u32 relative_fd; > + __u32 relative_id; > + }; > + __u32 expected_revision; > + } tcx; > }; > } link_create; > [...] > +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) > +{ > + struct net *net = current->nsproxy->net_ns; > + struct bpf_link_primer link_primer; > + struct net_device *dev; > + struct tcx_link *link; > + int fd, err; > + > + dev = dev_get_by_index(net, attr->link_create.target_ifindex); > + if (!dev) > + return -EINVAL; > + link = kzalloc(sizeof(*link), GFP_USER); > + if (!link) { > + err = -ENOMEM; > + goto out_put; > + } > + > + bpf_link_init(&link->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog); > + link->location = attr->link_create.attach_type; > + link->flags = attr->link_create.flags & (BPF_F_FIRST | BPF_F_LAST); > + link->dev = dev; > + > + err = bpf_link_prime(&link->link, &link_primer); > + if (err) { > + kfree(link); > + goto out_put; > + } > + rtnl_lock(); > + err = tcx_link_prog_attach(&link->link, attr->link_create.flags, > + attr->link_create.tcx.relative_fd, > + attr->link_create.tcx.expected_revision); > + if (!err) > + fd = bpf_link_settle(&link_primer); why this early settle? makes the error handling logic more convoluted. Maybe leave link->dev as is and let bpf_link_cleanup() handle dev_put(dev)? Can it be just: err = tcx_link_prog_attach(...); rtnl_unlock(); if (err) { link->dev = NULL; bpf_link_cleanup(&link_primer); goto out_put; } dev_put(dev); return bpf_link_settle(&link_primer); ? > + rtnl_unlock(); > + if (err) { > + link->dev = NULL; > + bpf_link_cleanup(&link_primer); > + goto out_put; > + } > + dev_put(dev); > + return fd; > +out_put: > + dev_put(dev); > + return err; > +} [...]