Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 8 Jun 2023 14:20:00 -0700

On Wed, Jun 7, 2023 at 12:27 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>
> This work refactors and adds a lightweight extension ("tcx") to the tc BPF
> ingress and egress data path side for allowing BPF program management based
> on fds via bpf() syscall through the newly added generic multi-prog API.
> The main goal behind this work which we also presented at LPC [0] last year
> and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
> BPF link functionality for tc BPF programs, which allows for a model of safe
> ownership and program detachment.
>
> Given the rise in tc BPF users in cloud native environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications accidentally stepping on each others toes.
> As a recap, a BPF link represents the attachment of a BPF program to a BPF
> hook point. The BPF link holds a single reference to keep BPF program alive.
> Moreover, hook points do not reference a BPF link, only the application's
> fd or pinning does. A BPF link holds meta-data specific to attachment and
> implements operations for link creation, (atomic) BPF program update,
> detachment and introspection. The motivation for BPF links for tc BPF programs
> is multi-fold, for example:
>
>   - From Meta: "It's especially important for applications that are deployed
>     fleet-wide and that don't "control" hosts they are deployed to. If such
>     application crashes and no one notices and does anything about that, BPF
>     program will keep running draining resources or even just, say, dropping
>     packets. We at FB had outages due to such permanent BPF attachment
>     semantics. With fd-based BPF link we are getting a framework, which allows
>     safe, auto-detachable behavior by default, unless application explicitly
>     opts in by pinning the BPF link." [1]
>
>   - From Cilium-side the tc BPF programs we attach to host-facing veth devices
>     and phys devices build the core datapath for Kubernetes Pods, and they
>     implement forwarding, load-balancing, policy, EDT-management, etc, within
>     BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
>     experienced hard-to-debug issues in a user's staging environment where
>     another Kubernetes application using tc BPF attached to the same prio/handle
>     of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
>     it. The goal is to establish a clear/safe ownership model via links which
>     cannot accidentally be overridden. [0,2]
>
> BPF links for tc can co-exist with non-link attachments, and the semantics are
> in line also with XDP links: BPF links cannot replace other BPF links, BPF
> links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
> lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
> would solve mentioned issue of safe ownership model as 3rd party applications
> would not be able to accidentally wipe Cilium programs, even if they are not
> BPF link aware.
>
> Earlier attempts [4] have tried to integrate BPF links into core tc machinery
> to solve cls_bpf, which has been intrusive to the generic tc kernel API with
> extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
> be wiped from the qdisc also. Locking a tc BPF program in place this way, is
> getting into layering hacks given the two object models are vastly different.
>
> We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
> attach API, so that the BPF link implementation blends in naturally similar to
> other link types which are fd-based and without the need for changing core tc
> internal APIs. BPF programs for tc can then be successively migrated from classic
> cls_bpf to the new tc BPF link without needing to change the program's source
> code, just the BPF loader mechanics for attaching is sufficient.
>
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook have a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> entry point for tc BPF. The name tcx has been suggested from discussion of
> earlier revisions of this work as a good fit, and to more easily differ between
> the classic cls_bpf attachment and the fd-based one.
>
> For the ingress and egress tcx points, the device holds a cache-friendly array
> with program pointers which is separated from control plane (slow-path) data.
> Earlier versions of this work used priority to determine ordering and expression
> of dependencies similar as with classic tc, but it was challenged that for
> something more future-proof a better user experience is required. Hence this
> resulted in the design and development of the generic attach/detach/query API
> for multi-progs. See prior patch with its discussion on the API design. tcx is
> the first user and later we plan to integrate also others, for example, one
> candidate is multi-prog support for XDP which would benefit and have the same
> 'look and feel' from API perspective.
>
> The goal with tcx is to have maximum compatibility to existing tc BPF programs,
> so they don't need to be rewritten specifically. Compatibility to call into
> classic tcf_classify() is also provided in order to allow successive migration
> or both to cleanly co-exist where needed given its all one logical tc layer.
> tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
> to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
> The fd-based API is behind a static key, so that when unused the code is also
> not entered. The struct tcx_entry's program array is currently static, but
> could be made dynamic if necessary at a point in future. The a/b pair swap
> design has been chosen so that for detachment there are no allocations which
> otherwise could fail. The work has been tested with tc-testing selftest suite
> which all passes, as well as the tc BPF tests from the BPF CI, and also with
> Cilium's L4LB.
>
> Kudos also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
> of this work.
>
>   [0] https://lpc.events/event/16/contributions/1353/
>   [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@xxxxxxxxxxxxxx/
>   [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
>   [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>   [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@xxxxxxxxx/
>
> Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> ---
>  MAINTAINERS                    |   4 +-
>  include/linux/netdevice.h      |  15 +-
>  include/linux/skbuff.h         |   4 +-
>  include/net/sch_generic.h      |   2 +-
>  include/net/tcx.h              | 157 +++++++++++++++
>  include/uapi/linux/bpf.h       |  35 +++-
>  kernel/bpf/Kconfig             |   1 +
>  kernel/bpf/Makefile            |   1 +
>  kernel/bpf/syscall.c           |  95 +++++++--
>  kernel/bpf/tcx.c               | 347 +++++++++++++++++++++++++++++++++
>  net/Kconfig                    |   5 +
>  net/core/dev.c                 | 267 +++++++++++++++----------
>  net/core/filter.c              |   4 +-
>  net/sched/Kconfig              |   4 +-
>  net/sched/sch_ingress.c        |  45 ++++-
>  tools/include/uapi/linux/bpf.h |  35 +++-
>  16 files changed, 877 insertions(+), 144 deletions(-)
>  create mode 100644 include/net/tcx.h
>  create mode 100644 kernel/bpf/tcx.c
>

[...]

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 207f8a37b327..e7584e24bc83 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1035,6 +1035,8 @@ enum bpf_attach_type {
>         BPF_TRACE_KPROBE_MULTI,
>         BPF_LSM_CGROUP,
>         BPF_STRUCT_OPS,
> +       BPF_TCX_INGRESS,
> +       BPF_TCX_EGRESS,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1052,7 +1054,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_KPROBE_MULTI = 8,
>         BPF_LINK_TYPE_STRUCT_OPS = 9,
>         BPF_LINK_TYPE_NETFILTER = 10,
> -
> +       BPF_LINK_TYPE_TCX = 11,
>         MAX_BPF_LINK_TYPE,
>  };
>
> @@ -1559,13 +1561,13 @@ union bpf_attr {
>                         __u32           map_fd;         /* struct_ops to attach */
>                 };
>                 union {
> -                       __u32           target_fd;      /* object to attach to */
> -                       __u32           target_ifindex; /* target ifindex */
> +                       __u32   target_fd;      /* target object to attach to or ... */
> +                       __u32   target_ifindex; /* target ifindex */
>                 };
>                 __u32           attach_type;    /* attach type */
>                 __u32           flags;          /* extra flags */
>                 union {
> -                       __u32           target_btf_id;  /* btf_id of target to attach to */
> +                       __u32   target_btf_id;  /* btf_id of target to attach to */

nit: should this part be in patch 1?

>                         struct {
>                                 __aligned_u64   iter_info;      /* extra bpf_iter_link_info */
>                                 __u32           iter_info_len;  /* iter_info length */
> @@ -1599,6 +1601,13 @@ union bpf_attr {
>                                 __s32           priority;
>                                 __u32           flags;
>                         } netfilter;
> +                       struct {
> +                               union {
> +                                       __u32   relative_fd;
> +                                       __u32   relative_id;
> +                               };
> +                               __u32           expected_revision;
> +                       } tcx;
>                 };
>         } link_create;
>

[...]

> +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_link_primer link_primer;
> +       struct net_device *dev;
> +       struct tcx_link *link;
> +       int fd, err;
> +
> +       dev = dev_get_by_index(net, attr->link_create.target_ifindex);
> +       if (!dev)
> +               return -EINVAL;
> +       link = kzalloc(sizeof(*link), GFP_USER);
> +       if (!link) {
> +               err = -ENOMEM;
> +               goto out_put;
> +       }
> +
> +       bpf_link_init(&link->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog);
> +       link->location = attr->link_create.attach_type;
> +       link->flags = attr->link_create.flags & (BPF_F_FIRST | BPF_F_LAST);
> +       link->dev = dev;
> +
> +       err = bpf_link_prime(&link->link, &link_primer);
> +       if (err) {
> +               kfree(link);
> +               goto out_put;
> +       }
> +       rtnl_lock();
> +       err = tcx_link_prog_attach(&link->link, attr->link_create.flags,
> +                                  attr->link_create.tcx.relative_fd,
> +                                  attr->link_create.tcx.expected_revision);
> +       if (!err)
> +               fd = bpf_link_settle(&link_primer);

why this early settle? makes the error handling logic more convoluted.
Maybe leave link->dev as is and let bpf_link_cleanup() handle
dev_put(dev)? Can it be just:

err = tcx_link_prog_attach(...);

rtnl_unlock();

if (err) {
    link->dev = NULL;
    bpf_link_cleanup(&link_primer);
    goto out_put;
}

dev_put(dev);
return bpf_link_settle(&link_primer);

?

> +       rtnl_unlock();
> +       if (err) {
> +               link->dev = NULL;
> +               bpf_link_cleanup(&link_primer);
> +               goto out_put;
> +       }
> +       dev_put(dev);
> +       return fd;
> +out_put:
> +       dev_put(dev);
> +       return err;
> +}

[...]