Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 5 Oct 2022 17:22:27 -0700

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>
> This work refactors and adds a lightweight extension to the tc BPF ingress
> and egress data path side for allowing BPF programs via an fd-based attach /
> detach API. The main goal behind this work which we also presented at LPC [0]
> this year is to eventually add support for BPF links for tc BPF programs in
> a second step, thus this prep work is required for the latter which allows
> for a model of safe ownership and program detachment. Given the vast rise
> in tc BPF users in cloud native / Kubernetes environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications stepping on each others toes. Further
> details for BPF link rationale in next patch.
>
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook gets a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> tc-layer entry point for BPF. As part of the feedback from LPC, there was
> a suggestion to provide a name for this infrastructure to more easily differ
> between the classic cls_bpf attachment and the fd-based API. As for most,
> the XDP vs tc layer is already the default mental model for the pkt processing
> pipeline. We refactored this with an xtc internal prefix aka 'express traffic
> control' in order to avoid to deviate too far (and 'express' given its more
> lightweight/faster entry point).
>
> For the ingress and egress xtc points, the device holds a cache-friendly array
> with programs. Same as with classic tc, programs are attached with a prio that
> can be specified or auto-allocated through an idr, and the program return code
> determines whether to continue in the pipeline or to terminate processing.
> With TC_ACT_UNSPEC code, the processing continues (as the case today). The goal
> was to have maximum compatibility to existing tc BPF programs, so they don't
> need to be adapted. Compatibility to call into classic tcf_classify() is also
> provided in order to allow successive migration or both to cleanly co-exist
> where needed given its one logical layer. The fd-based API is behind a static
> key, so that when unused the code is also not entered. The struct xtc_entry's
> program array is currently static, but could be made dynamic if necessary at
> a point in future. Desire has also been expressed for future work to adapt
> similar framework for XDP to allow multi-attach from in-kernel side, too.
>
> Tested with tc-testing selftest suite which all passes, as well as the tc BPF
> tests from the BPF CI.
>
>   [0] https://lpc.events/event/16/contributions/1353/
>
> Co-developed-by: Nikolay Aleksandrov <razor@xxxxxxxxxxxxx>
> Signed-off-by: Nikolay Aleksandrov <razor@xxxxxxxxxxxxx>
> Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> ---
>  MAINTAINERS                    |   4 +-
>  include/linux/bpf.h            |   1 +
>  include/linux/netdevice.h      |  14 +-
>  include/linux/skbuff.h         |   4 +-
>  include/net/sch_generic.h      |   2 +-
>  include/net/xtc.h              | 181 ++++++++++++++++++++++
>  include/uapi/linux/bpf.h       |  35 ++++-
>  kernel/bpf/Kconfig             |   1 +
>  kernel/bpf/Makefile            |   1 +
>  kernel/bpf/net.c               | 274 +++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |  24 ++-
>  net/Kconfig                    |   5 +
>  net/core/dev.c                 | 262 +++++++++++++++++++------------
>  net/core/filter.c              |   4 +-
>  net/sched/Kconfig              |   4 +-
>  net/sched/sch_ingress.c        |  48 +++++-
>  tools/include/uapi/linux/bpf.h |  35 ++++-
>  17 files changed, 769 insertions(+), 130 deletions(-)
>  create mode 100644 include/net/xtc.h
>  create mode 100644 kernel/bpf/net.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e55a4d47324c..bb63d8d000ea 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3850,13 +3850,15 @@ S:      Maintained
>  F:     kernel/trace/bpf_trace.c
>  F:     kernel/bpf/stackmap.c
>
> -BPF [NETWORKING] (tc BPF, sock_addr)
> +BPF [NETWORKING] (xtc & tc BPF, sock_addr)
>  M:     Martin KaFai Lau <martin.lau@xxxxxxxxx>
>  M:     Daniel Borkmann <daniel@xxxxxxxxxxxxx>
>  R:     John Fastabend <john.fastabend@xxxxxxxxx>
>  L:     bpf@xxxxxxxxxxxxxxx
>  L:     netdev@xxxxxxxxxxxxxxx
>  S:     Maintained
> +F:     include/net/xtc.h
> +F:     kernel/bpf/net.c
>  F:     net/core/filter.c
>  F:     net/sched/act_bpf.c
>  F:     net/sched/cls_bpf.c
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e7d46d16032..71e5f43db378 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1473,6 +1473,7 @@ struct bpf_prog_array_item {
>         union {
>                 struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>                 u64 bpf_cookie;
> +               u32 bpf_priority;

So this looks unfortunate and unnecessary. You are basically saying no
BPF cookie for this new TC/XTC/TCX thingy. But there is no need, we
already reserve 2 * 8 bytes for cgroup_storage, so make bpf_cookie and
bpf_prio co-exist with

union {
    struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
    struct {
        u64 bpf_cookie;
        u32 bpf_priority;
    };
}

or is there some problem with that?

>         };
>  };
>

[...]

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 51b9aa640ad2..de1f5546bcfe 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1025,6 +1025,8 @@ enum bpf_attach_type {
>         BPF_PERF_EVENT,
>         BPF_TRACE_KPROBE_MULTI,
>         BPF_LSM_CGROUP,
> +       BPF_NET_INGRESS,
> +       BPF_NET_EGRESS,

I can bikeshedding as well :) Shouldn't this be TC/TCX-specific attach
types? Wouldn't BPF_[X]TC[X]_INGRESS/BPF_[X]TC[X]_EGRESS be more
appropriate? Because when you think about it XDP is also NET, right,
so I find NET meaning really TC a bit confusing.

>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1399,14 +1401,20 @@ union bpf_attr {
>         };
>
>         struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -               __u32           target_fd;      /* container object to attach to */
> +               union {
> +                       __u32   target_fd;      /* container object to attach to */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };

this makes total sense (target can be FD or ifindex, we have that in
LINK_CREATE as well)

>                 __u32           attach_bpf_fd;  /* eBPF program to attach */
>                 __u32           attach_type;
>                 __u32           attach_flags;
> -               __u32           replace_bpf_fd; /* previously attached eBPF
> +               union {
> +                       __u32   attach_priority;
> +                       __u32   replace_bpf_fd; /* previously attached eBPF
>                                                  * program to replace if
>                                                  * BPF_F_REPLACE is used
>                                                  */
> +               };

But this union seems 1) unnecessary (we don't have to save those 4
bytes), but also 2) wouldn't it make sense to support replace_bpf_fd
with BPF_F_REPLACE (at given prio, if I understand correctly). It's
equivalent situation to what we had in cgroup programs land before we
got bpf_link. So staying consistent makes sense, unless I missed
something?

>         };
>
>         struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1452,7 +1460,10 @@ union bpf_attr {
>         } info;
>
>         struct { /* anonymous struct used by BPF_PROG_QUERY command */
> -               __u32           target_fd;      /* container object to query */
> +               union {
> +                       __u32   target_fd;      /* container object to query */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };
>                 __u32           attach_type;
>                 __u32           query_flags;
>                 __u32           attach_flags;
> @@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
>         };
>  };
>
> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> +       TC_NEXT         = -1,
> +       TC_PASS         = 0,
> +       TC_DROP         = 2,
> +       TC_REDIRECT     = 7,
> +};
> +
>  struct bpf_xdp_sock {
>         __u32 queue_id;
>  };
> @@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
>         __be32  flow_label;
>  };
>
> +struct bpf_query_info {

this is something that's returned from BPF_PROG_QUERY command, right?
Shouldn't it be called bpf_prog_query_info or something like that?
Just "query_info" is very generic, IMO, but if we are sure that there
will never be any other "QUERY" command, I guess it might be fine.

> +       __u32 prog_id;
> +       __u32 prio;
> +};
> +
>  struct bpf_func_info {
>         __u32   insn_off;
>         __u32   type_id;

[...]