On 06/07, Daniel Borkmann wrote: > This work refactors and adds a lightweight extension ("tcx") to the tc BPF > ingress and egress data path side for allowing BPF program management based > on fds via bpf() syscall through the newly added generic multi-prog API. > The main goal behind this work which we also presented at LPC [0] last year > and a recent update at LSF/MM/BPF this year [3] is to support long-awaited > BPF link functionality for tc BPF programs, which allows for a model of safe > ownership and program detachment. > > Given the rise in tc BPF users in cloud native environments, this becomes > necessary to avoid hard to debug incidents either through stale leftover > programs or 3rd party applications accidentally stepping on each others toes. > As a recap, a BPF link represents the attachment of a BPF program to a BPF > hook point. The BPF link holds a single reference to keep BPF program alive. > Moreover, hook points do not reference a BPF link, only the application's > fd or pinning does. A BPF link holds meta-data specific to attachment and > implements operations for link creation, (atomic) BPF program update, > detachment and introspection. The motivation for BPF links for tc BPF programs > is multi-fold, for example: > > - From Meta: "It's especially important for applications that are deployed > fleet-wide and that don't "control" hosts they are deployed to. If such > application crashes and no one notices and does anything about that, BPF > program will keep running draining resources or even just, say, dropping > packets. We at FB had outages due to such permanent BPF attachment > semantics. With fd-based BPF link we are getting a framework, which allows > safe, auto-detachable behavior by default, unless application explicitly > opts in by pinning the BPF link." [1] > > - From Cilium-side the tc BPF programs we attach to host-facing veth devices > and phys devices build the core datapath for Kubernetes Pods, and they > implement forwarding, load-balancing, policy, EDT-management, etc, within > BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently > experienced hard-to-debug issues in a user's staging environment where > another Kubernetes application using tc BPF attached to the same prio/handle > of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath > it. The goal is to establish a clear/safe ownership model via links which > cannot accidentally be overridden. [0,2] > > BPF links for tc can co-exist with non-link attachments, and the semantics are > in line also with XDP links: BPF links cannot replace other BPF links, BPF > links cannot replace non-BPF links, non-BPF links cannot replace BPF links and > lastly only non-BPF links can replace non-BPF links. In case of Cilium, this > would solve mentioned issue of safe ownership model as 3rd party applications > would not be able to accidentally wipe Cilium programs, even if they are not > BPF link aware. > > Earlier attempts [4] have tried to integrate BPF links into core tc machinery > to solve cls_bpf, which has been intrusive to the generic tc kernel API with > extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could > be wiped from the qdisc also. Locking a tc BPF program in place this way, is > getting into layering hacks given the two object models are vastly different. > > We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF > attach API, so that the BPF link implementation blends in naturally similar to > other link types which are fd-based and without the need for changing core tc > internal APIs. BPF programs for tc can then be successively migrated from classic > cls_bpf to the new tc BPF link without needing to change the program's source > code, just the BPF loader mechanics for attaching is sufficient. > > For the current tc framework, there is no change in behavior with this change > and neither does this change touch on tc core kernel APIs. The gist of this > patch is that the ingress and egress hook have a lightweight, qdisc-less > extension for BPF to attach its tc BPF programs, in other words, a minimal > entry point for tc BPF. The name tcx has been suggested from discussion of > earlier revisions of this work as a good fit, and to more easily differ between > the classic cls_bpf attachment and the fd-based one. > > For the ingress and egress tcx points, the device holds a cache-friendly array > with program pointers which is separated from control plane (slow-path) data. > Earlier versions of this work used priority to determine ordering and expression > of dependencies similar as with classic tc, but it was challenged that for > something more future-proof a better user experience is required. Hence this > resulted in the design and development of the generic attach/detach/query API > for multi-progs. See prior patch with its discussion on the API design. tcx is > the first user and later we plan to integrate also others, for example, one > candidate is multi-prog support for XDP which would benefit and have the same > 'look and feel' from API perspective. > > The goal with tcx is to have maximum compatibility to existing tc BPF programs, > so they don't need to be rewritten specifically. Compatibility to call into > classic tcf_classify() is also provided in order to allow successive migration > or both to cleanly co-exist where needed given its all one logical tc layer. > tcx supports the simplified return codes TCX_NEXT which is non-terminating (go > to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. > The fd-based API is behind a static key, so that when unused the code is also > not entered. The struct tcx_entry's program array is currently static, but > could be made dynamic if necessary at a point in future. The a/b pair swap > design has been chosen so that for detachment there are no allocations which > otherwise could fail. The work has been tested with tc-testing selftest suite > which all passes, as well as the tc BPF tests from the BPF CI, and also with > Cilium's L4LB. > > Kudos also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews > of this work. > > [0] https://lpc.events/event/16/contributions/1353/ > [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@xxxxxxxxxxxxxx/ > [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog > [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf > [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@xxxxxxxxx/ > > Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > --- > MAINTAINERS | 4 +- > include/linux/netdevice.h | 15 +- > include/linux/skbuff.h | 4 +- > include/net/sch_generic.h | 2 +- > include/net/tcx.h | 157 +++++++++++++++ > include/uapi/linux/bpf.h | 35 +++- > kernel/bpf/Kconfig | 1 + > kernel/bpf/Makefile | 1 + > kernel/bpf/syscall.c | 95 +++++++-- > kernel/bpf/tcx.c | 347 +++++++++++++++++++++++++++++++++ > net/Kconfig | 5 + > net/core/dev.c | 267 +++++++++++++++---------- > net/core/filter.c | 4 +- > net/sched/Kconfig | 4 +- > net/sched/sch_ingress.c | 45 ++++- > tools/include/uapi/linux/bpf.h | 35 +++- > 16 files changed, 877 insertions(+), 144 deletions(-) > create mode 100644 include/net/tcx.h > create mode 100644 kernel/bpf/tcx.c > > diff --git a/MAINTAINERS b/MAINTAINERS > index 754a9eeca0a1..7a0d0b0c5a5e 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -3827,13 +3827,15 @@ L: netdev@xxxxxxxxxxxxxxx > S: Maintained > F: kernel/bpf/bpf_struct* > > -BPF [NETWORKING] (tc BPF, sock_addr) > +BPF [NETWORKING] (tcx & tc BPF, sock_addr) > M: Martin KaFai Lau <martin.lau@xxxxxxxxx> > M: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > R: John Fastabend <john.fastabend@xxxxxxxxx> > L: bpf@xxxxxxxxxxxxxxx > L: netdev@xxxxxxxxxxxxxxx > S: Maintained > +F: include/net/tcx.h > +F: kernel/bpf/tcx.c > F: net/core/filter.c > F: net/sched/act_bpf.c > F: net/sched/cls_bpf.c > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 08fbd4622ccf..fd4281d1cdbb 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -1927,8 +1927,7 @@ enum netdev_ml_priv_type { > * > * @rx_handler: handler for received packets > * @rx_handler_data: XXX: need comments on this one > - * @miniq_ingress: ingress/clsact qdisc specific data for > - * ingress processing > + * @tcx_ingress: BPF & clsact qdisc specific data for ingress processing > * @ingress_queue: XXX: need comments on this one > * @nf_hooks_ingress: netfilter hooks executed for ingress packets > * @broadcast: hw bcast address > @@ -1949,8 +1948,7 @@ enum netdev_ml_priv_type { > * @xps_maps: all CPUs/RXQs maps for XPS device > * > * @xps_maps: XXX: need comments on this one > - * @miniq_egress: clsact qdisc specific data for > - * egress processing > + * @tcx_egress: BPF & clsact qdisc specific data for egress processing > * @nf_hooks_egress: netfilter hooks executed for egress packets > * @qdisc_hash: qdisc hash table > * @watchdog_timeo: Represents the timeout that is used by > @@ -2249,9 +2247,8 @@ struct net_device { > unsigned int gro_ipv4_max_size; > rx_handler_func_t __rcu *rx_handler; > void __rcu *rx_handler_data; > - > -#ifdef CONFIG_NET_CLS_ACT > - struct mini_Qdisc __rcu *miniq_ingress; > +#ifdef CONFIG_NET_XGRESS > + struct bpf_mprog_entry __rcu *tcx_ingress; > #endif > struct netdev_queue __rcu *ingress_queue; > #ifdef CONFIG_NETFILTER_INGRESS > @@ -2279,8 +2276,8 @@ struct net_device { > #ifdef CONFIG_XPS > struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX]; > #endif > -#ifdef CONFIG_NET_CLS_ACT > - struct mini_Qdisc __rcu *miniq_egress; > +#ifdef CONFIG_NET_XGRESS > + struct bpf_mprog_entry __rcu *tcx_egress; > #endif > #ifdef CONFIG_NETFILTER_EGRESS > struct nf_hook_entries __rcu *nf_hooks_egress; > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 5951904413ab..48c3e307f057 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -943,7 +943,7 @@ struct sk_buff { > __u8 __mono_tc_offset[0]; > /* public: */ > __u8 mono_delivery_time:1; /* See SKB_MONO_DELIVERY_TIME_MASK */ > -#ifdef CONFIG_NET_CLS_ACT > +#ifdef CONFIG_NET_XGRESS > __u8 tc_at_ingress:1; /* See TC_AT_INGRESS_MASK */ > __u8 tc_skip_classify:1; > #endif > @@ -992,7 +992,7 @@ struct sk_buff { > __u8 csum_not_inet:1; > #endif > > -#ifdef CONFIG_NET_SCHED > +#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) > __u16 tc_index; /* traffic control index */ > #endif > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index fab5ba3e61b7..0ade5d1a72b2 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -695,7 +695,7 @@ int skb_do_redirect(struct sk_buff *); > > static inline bool skb_at_tc_ingress(const struct sk_buff *skb) > { > -#ifdef CONFIG_NET_CLS_ACT > +#ifdef CONFIG_NET_XGRESS > return skb->tc_at_ingress; > #else > return false; > diff --git a/include/net/tcx.h b/include/net/tcx.h > new file mode 100644 > index 000000000000..27885ecedff9 > --- /dev/null > +++ b/include/net/tcx.h > @@ -0,0 +1,157 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* Copyright (c) 2023 Isovalent */ > +#ifndef __NET_TCX_H > +#define __NET_TCX_H > + > +#include <linux/bpf.h> > +#include <linux/bpf_mprog.h> > + > +#include <net/sch_generic.h> > + > +struct mini_Qdisc; > + > +struct tcx_entry { > + struct bpf_mprog_bundle bundle; > + struct mini_Qdisc __rcu *miniq; > +}; > + > +struct tcx_link { > + struct bpf_link link; > + struct net_device *dev; > + u32 location; > + u32 flags; > +}; > + > +static inline struct tcx_link *tcx_link(struct bpf_link *link) > +{ > + return container_of(link, struct tcx_link, link); > +} > + > +static inline const struct tcx_link *tcx_link_const(const struct bpf_link *link) > +{ > + return tcx_link((struct bpf_link *)link); > +} > + > +static inline void tcx_set_ingress(struct sk_buff *skb, bool ingress) > +{ > +#ifdef CONFIG_NET_XGRESS > + skb->tc_at_ingress = ingress; > +#endif > +} > + > +#ifdef CONFIG_NET_XGRESS > +void tcx_inc(void); > +void tcx_dec(void); > + > +static inline struct tcx_entry *tcx_entry(struct bpf_mprog_entry *entry) > +{ > + return container_of(entry->parent, struct tcx_entry, bundle); > +} > + > +static inline void > +tcx_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry, bool ingress) > +{ > + ASSERT_RTNL(); > + if (ingress) > + rcu_assign_pointer(dev->tcx_ingress, entry); > + else > + rcu_assign_pointer(dev->tcx_egress, entry); > +} > + > +static inline struct bpf_mprog_entry * > +dev_tcx_entry_fetch(struct net_device *dev, bool ingress) > +{ > + ASSERT_RTNL(); > + if (ingress) > + return rcu_dereference_rtnl(dev->tcx_ingress); > + else > + return rcu_dereference_rtnl(dev->tcx_egress); > +} > + > +static inline struct bpf_mprog_entry * [..] > +dev_tcx_entry_fetch_or_create(struct net_device *dev, bool ingress, bool *created) Regarding 'created' argument: any reason we are not doing conventional reference counting on bpf_mprog_entry? I wonder if there is a better way to hide those places where we handle BPF_MPROG_FREE explicitly. Btw, thinking of this a/b arrays, should we call them active/inactive? > +{ > + struct bpf_mprog_entry *entry = dev_tcx_entry_fetch(dev, ingress); > + > + *created = false; > + if (!entry) { > + entry = bpf_mprog_create(sizeof_field(struct tcx_entry, > + miniq)); > + if (!entry) > + return NULL; > + *created = true; > + } > + return entry; > +} > + > +static inline void tcx_skeys_inc(bool ingress) > +{ > + tcx_inc(); > + if (ingress) > + net_inc_ingress_queue(); > + else > + net_inc_egress_queue(); > +} > + > +static inline void tcx_skeys_dec(bool ingress) > +{ > + if (ingress) > + net_dec_ingress_queue(); > + else > + net_dec_egress_queue(); > + tcx_dec(); > +} > + > +static inline enum tcx_action_base tcx_action_code(struct sk_buff *skb, int code) > +{ > + switch (code) { > + case TCX_PASS: > + skb->tc_index = qdisc_skb_cb(skb)->tc_classid; > + fallthrough; > + case TCX_DROP: > + case TCX_REDIRECT: > + return code; > + case TCX_NEXT: > + default: > + return TCX_NEXT; > + } > +} > +#endif /* CONFIG_NET_XGRESS */ > + > +#if defined(CONFIG_NET_XGRESS) && defined(CONFIG_BPF_SYSCALL) > +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog); > +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog); > +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog); > +int tcx_prog_query(const union bpf_attr *attr, > + union bpf_attr __user *uattr); > +void dev_tcx_uninstall(struct net_device *dev); > +#else > +static inline int tcx_prog_attach(const union bpf_attr *attr, > + struct bpf_prog *prog) > +{ > + return -EINVAL; > +} > + > +static inline int tcx_link_attach(const union bpf_attr *attr, > + struct bpf_prog *prog) > +{ > + return -EINVAL; > +} > + > +static inline int tcx_prog_detach(const union bpf_attr *attr, > + struct bpf_prog *prog) > +{ > + return -EINVAL; > +} > + > +static inline int tcx_prog_query(const union bpf_attr *attr, > + union bpf_attr __user *uattr) > +{ > + return -EINVAL; > +} > + > +static inline void dev_tcx_uninstall(struct net_device *dev) > +{ > +} > +#endif /* CONFIG_NET_XGRESS && CONFIG_BPF_SYSCALL */ > +#endif /* __NET_TCX_H */ > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 207f8a37b327..e7584e24bc83 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -1035,6 +1035,8 @@ enum bpf_attach_type { > BPF_TRACE_KPROBE_MULTI, > BPF_LSM_CGROUP, > BPF_STRUCT_OPS, > + BPF_TCX_INGRESS, > + BPF_TCX_EGRESS, > __MAX_BPF_ATTACH_TYPE > }; > > @@ -1052,7 +1054,7 @@ enum bpf_link_type { > BPF_LINK_TYPE_KPROBE_MULTI = 8, > BPF_LINK_TYPE_STRUCT_OPS = 9, > BPF_LINK_TYPE_NETFILTER = 10, > - > + BPF_LINK_TYPE_TCX = 11, > MAX_BPF_LINK_TYPE, > }; > > @@ -1559,13 +1561,13 @@ union bpf_attr { > __u32 map_fd; /* struct_ops to attach */ > }; > union { > - __u32 target_fd; /* object to attach to */ > - __u32 target_ifindex; /* target ifindex */ > + __u32 target_fd; /* target object to attach to or ... */ > + __u32 target_ifindex; /* target ifindex */ > }; > __u32 attach_type; /* attach type */ > __u32 flags; /* extra flags */ > union { > - __u32 target_btf_id; /* btf_id of target to attach to */ > + __u32 target_btf_id; /* btf_id of target to attach to */ > struct { > __aligned_u64 iter_info; /* extra bpf_iter_link_info */ > __u32 iter_info_len; /* iter_info length */ > @@ -1599,6 +1601,13 @@ union bpf_attr { > __s32 priority; > __u32 flags; > } netfilter; > + struct { > + union { > + __u32 relative_fd; > + __u32 relative_id; > + }; > + __u32 expected_revision; > + } tcx; > }; > } link_create; > > @@ -6207,6 +6216,19 @@ struct bpf_sock_tuple { > }; > }; > > +/* (Simplified) user return codes for tcx prog type. > + * A valid tcx program must return one of these defined values. All other > + * return codes are reserved for future use. Must remain compatible with > + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown > + * return codes are mapped to TCX_NEXT. > + */ > +enum tcx_action_base { > + TCX_NEXT = -1, > + TCX_PASS = 0, > + TCX_DROP = 2, > + TCX_REDIRECT = 7, > +}; > + > struct bpf_xdp_sock { > __u32 queue_id; > }; > @@ -6459,6 +6481,11 @@ struct bpf_link_info { > __s32 priority; > __u32 flags; > } netfilter; > + struct { > + __u32 ifindex; > + __u32 attach_type; > + __u32 flags; > + } tcx; > }; > } __attribute__((aligned(8))); > > diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig > index 2dfe1079f772..6a906ff93006 100644 > --- a/kernel/bpf/Kconfig > +++ b/kernel/bpf/Kconfig > @@ -31,6 +31,7 @@ config BPF_SYSCALL > select TASKS_TRACE_RCU > select BINARY_PRINTF > select NET_SOCK_MSG if NET > + select NET_XGRESS if NET > select PAGE_POOL if NET > default n > help > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > index 1bea2eb912cd..f526b7573e97 100644 > --- a/kernel/bpf/Makefile > +++ b/kernel/bpf/Makefile > @@ -21,6 +21,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o > obj-$(CONFIG_BPF_SYSCALL) += cpumap.o > obj-$(CONFIG_BPF_SYSCALL) += offload.o > obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o > +obj-$(CONFIG_BPF_SYSCALL) += tcx.o > endif > ifeq ($(CONFIG_PERF_EVENTS),y) > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 92a57efc77de..e2c219d053f4 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -37,6 +37,8 @@ > #include <linux/trace_events.h> > #include <net/netfilter/nf_bpf_link.h> > > +#include <net/tcx.h> > + > #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ > (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \ > (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) > @@ -3522,31 +3524,57 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type) > return BPF_PROG_TYPE_XDP; > case BPF_LSM_CGROUP: > return BPF_PROG_TYPE_LSM; > + case BPF_TCX_INGRESS: > + case BPF_TCX_EGRESS: > + return BPF_PROG_TYPE_SCHED_CLS; > default: > return BPF_PROG_TYPE_UNSPEC; > } > } > > -#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd > +#define BPF_PROG_ATTACH_LAST_FIELD expected_revision > + > +#define BPF_F_ATTACH_MASK_BASE \ > + (BPF_F_ALLOW_OVERRIDE | \ > + BPF_F_ALLOW_MULTI | \ > + BPF_F_REPLACE) > + > +#define BPF_F_ATTACH_MASK_MPROG \ > + (BPF_F_REPLACE | \ > + BPF_F_BEFORE | \ > + BPF_F_AFTER | \ > + BPF_F_FIRST | \ > + BPF_F_LAST | \ > + BPF_F_ID | \ > + BPF_F_LINK) > > -#define BPF_F_ATTACH_MASK \ > - (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE) > +static bool bpf_supports_mprog(enum bpf_prog_type ptype) > +{ > + switch (ptype) { > + case BPF_PROG_TYPE_SCHED_CLS: > + return true; > + default: > + return false; > + } > +} > > static int bpf_prog_attach(const union bpf_attr *attr) > { > enum bpf_prog_type ptype; > struct bpf_prog *prog; > + u32 mask; > int ret; > > if (CHECK_ATTR(BPF_PROG_ATTACH)) > return -EINVAL; > > - if (attr->attach_flags & ~BPF_F_ATTACH_MASK) > - return -EINVAL; > - > ptype = attach_type_to_prog_type(attr->attach_type); > if (ptype == BPF_PROG_TYPE_UNSPEC) > return -EINVAL; > + mask = bpf_supports_mprog(ptype) ? > + BPF_F_ATTACH_MASK_MPROG : BPF_F_ATTACH_MASK_BASE; > + if (attr->attach_flags & ~mask) > + return -EINVAL; > > prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); > if (IS_ERR(prog)) > @@ -3582,6 +3610,9 @@ static int bpf_prog_attach(const union bpf_attr *attr) > else > ret = cgroup_bpf_prog_attach(attr, ptype, prog); > break; > + case BPF_PROG_TYPE_SCHED_CLS: > + ret = tcx_prog_attach(attr, prog); > + break; > default: > ret = -EINVAL; > } > @@ -3591,25 +3622,42 @@ static int bpf_prog_attach(const union bpf_attr *attr) > return ret; > } > > -#define BPF_PROG_DETACH_LAST_FIELD attach_type > +#define BPF_PROG_DETACH_LAST_FIELD expected_revision > > static int bpf_prog_detach(const union bpf_attr *attr) > { > + struct bpf_prog *prog = NULL; > enum bpf_prog_type ptype; > + int ret; > > if (CHECK_ATTR(BPF_PROG_DETACH)) > return -EINVAL; > > ptype = attach_type_to_prog_type(attr->attach_type); > + if (bpf_supports_mprog(ptype)) { > + if (ptype == BPF_PROG_TYPE_UNSPEC) > + return -EINVAL; > + if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG) > + return -EINVAL; > + prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); > + if (IS_ERR(prog)) { > + if ((int)attr->attach_bpf_fd > 0) > + return PTR_ERR(prog); > + prog = NULL; > + } > + } > > switch (ptype) { > case BPF_PROG_TYPE_SK_MSG: > case BPF_PROG_TYPE_SK_SKB: > - return sock_map_prog_detach(attr, ptype); > + ret = sock_map_prog_detach(attr, ptype); > + break; > case BPF_PROG_TYPE_LIRC_MODE2: > - return lirc_prog_detach(attr); > + ret = lirc_prog_detach(attr); > + break; > case BPF_PROG_TYPE_FLOW_DISSECTOR: > - return netns_bpf_prog_detach(attr, ptype); > + ret = netns_bpf_prog_detach(attr, ptype); > + break; > case BPF_PROG_TYPE_CGROUP_DEVICE: > case BPF_PROG_TYPE_CGROUP_SKB: > case BPF_PROG_TYPE_CGROUP_SOCK: > @@ -3618,13 +3666,21 @@ static int bpf_prog_detach(const union bpf_attr *attr) > case BPF_PROG_TYPE_CGROUP_SYSCTL: > case BPF_PROG_TYPE_SOCK_OPS: > case BPF_PROG_TYPE_LSM: > - return cgroup_bpf_prog_detach(attr, ptype); > + ret = cgroup_bpf_prog_detach(attr, ptype); > + break; > + case BPF_PROG_TYPE_SCHED_CLS: > + ret = tcx_prog_detach(attr, prog); > + break; > default: > - return -EINVAL; > + ret = -EINVAL; > } > + > + if (prog) > + bpf_prog_put(prog); > + return ret; > } > > -#define BPF_PROG_QUERY_LAST_FIELD query.prog_attach_flags > +#define BPF_PROG_QUERY_LAST_FIELD query.link_attach_flags > > static int bpf_prog_query(const union bpf_attr *attr, > union bpf_attr __user *uattr) > @@ -3672,6 +3728,9 @@ static int bpf_prog_query(const union bpf_attr *attr, > case BPF_SK_MSG_VERDICT: > case BPF_SK_SKB_VERDICT: > return sock_map_bpf_prog_query(attr, uattr); > + case BPF_TCX_INGRESS: > + case BPF_TCX_EGRESS: > + return tcx_prog_query(attr, uattr); > default: > return -EINVAL; > } > @@ -4629,6 +4688,13 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr) > goto out; > } > break; > + case BPF_PROG_TYPE_SCHED_CLS: > + if (attr->link_create.attach_type != BPF_TCX_INGRESS && > + attr->link_create.attach_type != BPF_TCX_EGRESS) { > + ret = -EINVAL; > + goto out; > + } > + break; > default: > ptype = attach_type_to_prog_type(attr->link_create.attach_type); > if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) { > @@ -4680,6 +4746,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr) > case BPF_PROG_TYPE_XDP: > ret = bpf_xdp_link_attach(attr, prog); > break; > + case BPF_PROG_TYPE_SCHED_CLS: > + ret = tcx_link_attach(attr, prog); > + break; > case BPF_PROG_TYPE_NETFILTER: > ret = bpf_nf_link_attach(attr, prog); > break; > diff --git a/kernel/bpf/tcx.c b/kernel/bpf/tcx.c > new file mode 100644 > index 000000000000..d3d23b4ed4f0 > --- /dev/null > +++ b/kernel/bpf/tcx.c > @@ -0,0 +1,347 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright (c) 2023 Isovalent */ > + > +#include <linux/bpf.h> > +#include <linux/bpf_mprog.h> > +#include <linux/netdevice.h> > + > +#include <net/tcx.h> > + > +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog) > +{ > + bool created, ingress = attr->attach_type == BPF_TCX_INGRESS; > + struct net *net = current->nsproxy->net_ns; > + struct bpf_mprog_entry *entry; > + struct net_device *dev; > + int ret; > + > + rtnl_lock(); > + dev = __dev_get_by_index(net, attr->target_ifindex); > + if (!dev) { > + ret = -ENODEV; > + goto out; > + } > + entry = dev_tcx_entry_fetch_or_create(dev, ingress, &created); > + if (!entry) { > + ret = -ENOMEM; > + goto out; > + } > + ret = bpf_mprog_attach(entry, prog, NULL, attr->attach_flags, > + attr->relative_fd, attr->expected_revision); > + if (ret >= 0) { > + if (ret == BPF_MPROG_SWAP) > + tcx_entry_update(dev, bpf_mprog_peer(entry), ingress); > + bpf_mprog_commit(entry); > + tcx_skeys_inc(ingress); > + ret = 0; > + } else if (created) { > + bpf_mprog_free(entry); > + } > +out: > + rtnl_unlock(); > + return ret; > +} > + > +static bool tcx_release_entry(struct bpf_mprog_entry *entry, int code) > +{ > + return code == BPF_MPROG_FREE && !tcx_entry(entry)->miniq; > +} > + > +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog) > +{ > + bool tcx_release, ingress = attr->attach_type == BPF_TCX_INGRESS; > + struct net *net = current->nsproxy->net_ns; > + struct bpf_mprog_entry *entry, *peer; > + struct net_device *dev; > + int ret; > + > + rtnl_lock(); > + dev = __dev_get_by_index(net, attr->target_ifindex); > + if (!dev) { > + ret = -ENODEV; > + goto out; > + } > + entry = dev_tcx_entry_fetch(dev, ingress); > + if (!entry) { > + ret = -ENOENT; > + goto out; > + } > + ret = bpf_mprog_detach(entry, prog, NULL, attr->attach_flags, > + attr->relative_fd, attr->expected_revision); > + if (ret >= 0) { > + tcx_release = tcx_release_entry(entry, ret); > + peer = tcx_release ? NULL : bpf_mprog_peer(entry); > + if (ret == BPF_MPROG_SWAP || ret == BPF_MPROG_FREE) > + tcx_entry_update(dev, peer, ingress); > + bpf_mprog_commit(entry); > + tcx_skeys_dec(ingress); > + if (tcx_release) > + bpf_mprog_free(entry); > + ret = 0; > + } > +out: > + rtnl_unlock(); > + return ret; > +} > + > +static void tcx_uninstall(struct net_device *dev, bool ingress) > +{ > + struct bpf_tuple tuple = {}; > + struct bpf_mprog_entry *entry; > + struct bpf_mprog_fp *fp; > + struct bpf_mprog_cp *cp; > + > + entry = dev_tcx_entry_fetch(dev, ingress); > + if (!entry) > + return; > + tcx_entry_update(dev, NULL, ingress); > + bpf_mprog_commit(entry); > + bpf_mprog_foreach_tuple(entry, fp, cp, tuple) { > + if (tuple.link) > + tcx_link(tuple.link)->dev = NULL; > + else > + bpf_prog_put(tuple.prog); > + tcx_skeys_dec(ingress); > + } > + WARN_ON_ONCE(tcx_entry(entry)->miniq); > + bpf_mprog_free(entry); > +} > + > +void dev_tcx_uninstall(struct net_device *dev) > +{ > + ASSERT_RTNL(); > + tcx_uninstall(dev, true); > + tcx_uninstall(dev, false); > +} > + > +int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) > +{ > + bool ingress = attr->query.attach_type == BPF_TCX_INGRESS; > + struct net *net = current->nsproxy->net_ns; > + struct bpf_mprog_entry *entry; > + struct net_device *dev; > + int ret; > + > + rtnl_lock(); > + dev = __dev_get_by_index(net, attr->query.target_ifindex); > + if (!dev) { > + ret = -ENODEV; > + goto out; > + } > + entry = dev_tcx_entry_fetch(dev, ingress); > + if (!entry) { > + ret = -ENOENT; > + goto out; > + } > + ret = bpf_mprog_query(attr, uattr, entry); > +out: > + rtnl_unlock(); > + return ret; > +} > + > +static int tcx_link_prog_attach(struct bpf_link *l, u32 flags, u32 object, > + u32 expected_revision) > +{ > + struct tcx_link *link = tcx_link(l); > + bool created, ingress = link->location == BPF_TCX_INGRESS; > + struct net_device *dev = link->dev; > + struct bpf_mprog_entry *entry; > + int ret; > + > + ASSERT_RTNL(); > + entry = dev_tcx_entry_fetch_or_create(dev, ingress, &created); > + if (!entry) > + return -ENOMEM; > + ret = bpf_mprog_attach(entry, l->prog, l, flags, object, > + expected_revision); > + if (ret >= 0) { > + if (ret == BPF_MPROG_SWAP) > + tcx_entry_update(dev, bpf_mprog_peer(entry), ingress); > + bpf_mprog_commit(entry); > + tcx_skeys_inc(ingress); > + ret = 0; > + } else if (created) { > + bpf_mprog_free(entry); > + } > + return ret; > +} > + > +static void tcx_link_release(struct bpf_link *l) > +{ > + struct tcx_link *link = tcx_link(l); > + bool tcx_release, ingress = link->location == BPF_TCX_INGRESS; > + struct bpf_mprog_entry *entry, *peer; > + struct net_device *dev; > + int ret = 0; > + > + rtnl_lock(); > + dev = link->dev; > + if (!dev) > + goto out; > + entry = dev_tcx_entry_fetch(dev, ingress); > + if (!entry) { > + ret = -ENOENT; > + goto out; > + } > + ret = bpf_mprog_detach(entry, l->prog, l, link->flags, 0, 0); > + if (ret >= 0) { > + tcx_release = tcx_release_entry(entry, ret); > + peer = tcx_release ? NULL : bpf_mprog_peer(entry); > + if (ret == BPF_MPROG_SWAP || ret == BPF_MPROG_FREE) > + tcx_entry_update(dev, peer, ingress); > + bpf_mprog_commit(entry); > + tcx_skeys_dec(ingress); > + if (tcx_release) > + bpf_mprog_free(entry); > + link->dev = NULL; > + ret = 0; > + } > +out: > + WARN_ON_ONCE(ret); > + rtnl_unlock(); > +} > + > +static int tcx_link_update(struct bpf_link *l, struct bpf_prog *nprog, > + struct bpf_prog *oprog) > +{ > + struct tcx_link *link = tcx_link(l); > + bool ingress = link->location == BPF_TCX_INGRESS; > + struct net_device *dev = link->dev; > + struct bpf_mprog_entry *entry; > + int ret = 0; > + > + rtnl_lock(); > + if (!link->dev) { > + ret = -ENOLINK; > + goto out; > + } > + if (oprog && l->prog != oprog) { > + ret = -EPERM; > + goto out; > + } > + oprog = l->prog; > + if (oprog == nprog) { > + bpf_prog_put(nprog); > + goto out; > + } > + entry = dev_tcx_entry_fetch(dev, ingress); > + if (!entry) { > + ret = -ENOENT; > + goto out; > + } > + ret = bpf_mprog_attach(entry, nprog, l, > + BPF_F_REPLACE | BPF_F_ID | link->flags, > + l->prog->aux->id, 0); > + if (ret >= 0) { > + if (ret == BPF_MPROG_SWAP) > + tcx_entry_update(dev, bpf_mprog_peer(entry), ingress); > + bpf_mprog_commit(entry); > + tcx_skeys_inc(ingress); > + oprog = xchg(&l->prog, nprog); > + bpf_prog_put(oprog); > + ret = 0; > + } > +out: > + rtnl_unlock(); > + return ret; > +} > + > +static void tcx_link_dealloc(struct bpf_link *l) > +{ > + kfree(tcx_link(l)); > +} > + > +static void tcx_link_fdinfo(const struct bpf_link *l, struct seq_file *seq) > +{ > + const struct tcx_link *link = tcx_link_const(l); > + u32 ifindex = 0; > + > + rtnl_lock(); > + if (link->dev) > + ifindex = link->dev->ifindex; > + rtnl_unlock(); > + > + seq_printf(seq, "ifindex:\t%u\n", ifindex); > + seq_printf(seq, "attach_type:\t%u (%s)\n", > + link->location, > + link->location == BPF_TCX_INGRESS ? "ingress" : "egress"); > + seq_printf(seq, "flags:\t%u\n", link->flags); > +} > + > +static int tcx_link_fill_info(const struct bpf_link *l, > + struct bpf_link_info *info) > +{ > + const struct tcx_link *link = tcx_link_const(l); > + u32 ifindex = 0; > + > + rtnl_lock(); > + if (link->dev) > + ifindex = link->dev->ifindex; > + rtnl_unlock(); > + > + info->tcx.ifindex = ifindex; > + info->tcx.attach_type = link->location; > + info->tcx.flags = link->flags; > + return 0; > +} > + > +static int tcx_link_detach(struct bpf_link *l) > +{ > + tcx_link_release(l); > + return 0; > +} > + > +static const struct bpf_link_ops tcx_link_lops = { > + .release = tcx_link_release, > + .detach = tcx_link_detach, > + .dealloc = tcx_link_dealloc, > + .update_prog = tcx_link_update, > + .show_fdinfo = tcx_link_fdinfo, > + .fill_link_info = tcx_link_fill_info, > +}; > + > +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) > +{ > + struct net *net = current->nsproxy->net_ns; > + struct bpf_link_primer link_primer; > + struct net_device *dev; > + struct tcx_link *link; > + int fd, err; > + > + dev = dev_get_by_index(net, attr->link_create.target_ifindex); > + if (!dev) > + return -EINVAL; > + link = kzalloc(sizeof(*link), GFP_USER); > + if (!link) { > + err = -ENOMEM; > + goto out_put; > + } > + > + bpf_link_init(&link->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog); > + link->location = attr->link_create.attach_type; > + link->flags = attr->link_create.flags & (BPF_F_FIRST | BPF_F_LAST); > + link->dev = dev; > + > + err = bpf_link_prime(&link->link, &link_primer); > + if (err) { > + kfree(link); > + goto out_put; > + } > + rtnl_lock(); > + err = tcx_link_prog_attach(&link->link, attr->link_create.flags, > + attr->link_create.tcx.relative_fd, > + attr->link_create.tcx.expected_revision); > + if (!err) > + fd = bpf_link_settle(&link_primer); > + rtnl_unlock(); > + if (err) { > + link->dev = NULL; > + bpf_link_cleanup(&link_primer); > + goto out_put; > + } > + dev_put(dev); > + return fd; > +out_put: > + dev_put(dev); > + return err; > +} > diff --git a/net/Kconfig b/net/Kconfig > index 2fb25b534df5..d532ec33f1fe 100644 > --- a/net/Kconfig > +++ b/net/Kconfig > @@ -52,6 +52,11 @@ config NET_INGRESS > config NET_EGRESS > bool > > +config NET_XGRESS > + select NET_INGRESS > + select NET_EGRESS > + bool > + > config NET_REDIRECT > bool > > diff --git a/net/core/dev.c b/net/core/dev.c > index 3393c2f3dbe8..95c7e3189884 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -107,6 +107,7 @@ > #include <net/pkt_cls.h> > #include <net/checksum.h> > #include <net/xfrm.h> > +#include <net/tcx.h> > #include <linux/highmem.h> > #include <linux/init.h> > #include <linux/module.h> > @@ -154,7 +155,6 @@ > #include "dev.h" > #include "net-sysfs.h" > > - > static DEFINE_SPINLOCK(ptype_lock); > struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly; > struct list_head ptype_all __read_mostly; /* Taps */ > @@ -3923,69 +3923,200 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb) > EXPORT_SYMBOL(dev_loopback_xmit); > > #ifdef CONFIG_NET_EGRESS > -static struct sk_buff * > -sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > +static struct netdev_queue * > +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) > +{ > + int qm = skb_get_queue_mapping(skb); > + > + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); > +} > + > +static bool netdev_xmit_txqueue_skipped(void) > { > + return __this_cpu_read(softnet_data.xmit.skip_txqueue); > +} > + > +void netdev_xmit_skip_txqueue(bool skip) > +{ > + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); > +} > +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); > +#endif /* CONFIG_NET_EGRESS */ > + > +#ifdef CONFIG_NET_XGRESS > +static int tc_run(struct tcx_entry *entry, struct sk_buff *skb) > +{ > + int ret = TC_ACT_UNSPEC; > #ifdef CONFIG_NET_CLS_ACT > - struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress); > - struct tcf_result cl_res; > + struct mini_Qdisc *miniq = rcu_dereference_bh(entry->miniq); > + struct tcf_result res; > > if (!miniq) > - return skb; > + return ret; > > - /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */ > tc_skb_cb(skb)->mru = 0; > tc_skb_cb(skb)->post_ct = false; > - mini_qdisc_bstats_cpu_update(miniq, skb); > > - switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) { > + mini_qdisc_bstats_cpu_update(miniq, skb); > + ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false); > + /* Only tcf related quirks below. */ > + switch (ret) { > + case TC_ACT_SHOT: > + mini_qdisc_qstats_cpu_drop(miniq); > + break; > case TC_ACT_OK: > case TC_ACT_RECLASSIFY: > - skb->tc_index = TC_H_MIN(cl_res.classid); > + skb->tc_index = TC_H_MIN(res.classid); > break; > + } > +#endif /* CONFIG_NET_CLS_ACT */ > + return ret; > +} > + > +static DEFINE_STATIC_KEY_FALSE(tcx_needed_key); > + > +void tcx_inc(void) > +{ > + static_branch_inc(&tcx_needed_key); > +} > +EXPORT_SYMBOL_GPL(tcx_inc); > + > +void tcx_dec(void) > +{ > + static_branch_dec(&tcx_needed_key); > +} > +EXPORT_SYMBOL_GPL(tcx_dec); > + > +static __always_inline enum tcx_action_base > +tcx_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb, > + const bool needs_mac) > +{ > + const struct bpf_mprog_fp *fp; > + const struct bpf_prog *prog; > + int ret = TCX_NEXT; > + > + if (needs_mac) > + __skb_push(skb, skb->mac_len); > + bpf_mprog_foreach_prog(entry, fp, prog) { > + bpf_compute_data_pointers(skb); > + ret = bpf_prog_run(prog, skb); > + if (ret != TCX_NEXT) > + break; > + } > + if (needs_mac) > + __skb_pull(skb, skb->mac_len); > + return tcx_action_code(skb, ret); > +} > + > +static __always_inline struct sk_buff * > +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, > + struct net_device *orig_dev, bool *another) > +{ > + struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress); > + int sch_ret; > + > + if (!entry) > + return skb; > + if (*pt_prev) { > + *ret = deliver_skb(skb, *pt_prev, orig_dev); > + *pt_prev = NULL; > + } > + > + qdisc_skb_cb(skb)->pkt_len = skb->len; > + tcx_set_ingress(skb, true); > + > + if (static_branch_unlikely(&tcx_needed_key)) { > + sch_ret = tcx_run(entry, skb, true); > + if (sch_ret != TC_ACT_UNSPEC) > + goto ingress_verdict; > + } > + sch_ret = tc_run(container_of(entry->parent, struct tcx_entry, bundle), skb); > +ingress_verdict: > + switch (sch_ret) { > + case TC_ACT_REDIRECT: > + /* skb_mac_header check was done by BPF, so we can safely > + * push the L2 header back before redirecting to another > + * netdev. > + */ > + __skb_push(skb, skb->mac_len); > + if (skb_do_redirect(skb) == -EAGAIN) { > + __skb_pull(skb, skb->mac_len); > + *another = true; > + break; > + } > + *ret = NET_RX_SUCCESS; > + return NULL; > case TC_ACT_SHOT: > - mini_qdisc_qstats_cpu_drop(miniq); > - *ret = NET_XMIT_DROP; > - kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS); > + kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS); > + *ret = NET_RX_DROP; > return NULL; > + /* used by tc_run */ > case TC_ACT_STOLEN: > case TC_ACT_QUEUED: > case TC_ACT_TRAP: > - *ret = NET_XMIT_SUCCESS; > consume_skb(skb); > + fallthrough; > + case TC_ACT_CONSUMED: > + *ret = NET_RX_SUCCESS; > return NULL; > + } > + > + return skb; > +} > + > +static __always_inline struct sk_buff * > +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > +{ > + struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress); > + int sch_ret; > + > + if (!entry) > + return skb; > + > + /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was > + * already set by the caller. > + */ > + if (static_branch_unlikely(&tcx_needed_key)) { > + sch_ret = tcx_run(entry, skb, false); > + if (sch_ret != TC_ACT_UNSPEC) > + goto egress_verdict; > + } > + sch_ret = tc_run(container_of(entry->parent, struct tcx_entry, bundle), skb); > +egress_verdict: > + switch (sch_ret) { > case TC_ACT_REDIRECT: > /* No need to push/pop skb's mac_header here on egress! */ > skb_do_redirect(skb); > *ret = NET_XMIT_SUCCESS; > return NULL; > - default: > - break; > + case TC_ACT_SHOT: > + kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS); > + *ret = NET_XMIT_DROP; > + return NULL; > + /* used by tc_run */ > + case TC_ACT_STOLEN: > + case TC_ACT_QUEUED: > + case TC_ACT_TRAP: > + *ret = NET_XMIT_SUCCESS; > + return NULL; > } > -#endif /* CONFIG_NET_CLS_ACT */ > > return skb; > } > - > -static struct netdev_queue * > -netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) > -{ > - int qm = skb_get_queue_mapping(skb); > - > - return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); > -} > - > -static bool netdev_xmit_txqueue_skipped(void) > +#else > +static __always_inline struct sk_buff * > +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, > + struct net_device *orig_dev, bool *another) > { > - return __this_cpu_read(softnet_data.xmit.skip_txqueue); > + return skb; > } > > -void netdev_xmit_skip_txqueue(bool skip) > +static __always_inline struct sk_buff * > +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > { > - __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); > + return skb; > } > -EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); > -#endif /* CONFIG_NET_EGRESS */ > +#endif /* CONFIG_NET_XGRESS */ > > #ifdef CONFIG_XPS > static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb, > @@ -4169,9 +4300,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > skb_update_prio(skb); > > qdisc_pkt_len_init(skb); > -#ifdef CONFIG_NET_CLS_ACT > - skb->tc_at_ingress = 0; > -#endif > + tcx_set_ingress(skb, false); > #ifdef CONFIG_NET_EGRESS > if (static_branch_unlikely(&egress_needed_key)) { > if (nf_hook_egress_active()) { > @@ -5103,72 +5232,6 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev, > EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook); > #endif > > -static inline struct sk_buff * > -sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, > - struct net_device *orig_dev, bool *another) > -{ > -#ifdef CONFIG_NET_CLS_ACT > - struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress); > - struct tcf_result cl_res; > - > - /* If there's at least one ingress present somewhere (so > - * we get here via enabled static key), remaining devices > - * that are not configured with an ingress qdisc will bail > - * out here. > - */ > - if (!miniq) > - return skb; > - > - if (*pt_prev) { > - *ret = deliver_skb(skb, *pt_prev, orig_dev); > - *pt_prev = NULL; > - } > - > - qdisc_skb_cb(skb)->pkt_len = skb->len; > - tc_skb_cb(skb)->mru = 0; > - tc_skb_cb(skb)->post_ct = false; > - skb->tc_at_ingress = 1; > - mini_qdisc_bstats_cpu_update(miniq, skb); > - > - switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) { > - case TC_ACT_OK: > - case TC_ACT_RECLASSIFY: > - skb->tc_index = TC_H_MIN(cl_res.classid); > - break; > - case TC_ACT_SHOT: > - mini_qdisc_qstats_cpu_drop(miniq); > - kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS); > - *ret = NET_RX_DROP; > - return NULL; > - case TC_ACT_STOLEN: > - case TC_ACT_QUEUED: > - case TC_ACT_TRAP: > - consume_skb(skb); > - *ret = NET_RX_SUCCESS; > - return NULL; > - case TC_ACT_REDIRECT: > - /* skb_mac_header check was done by cls/act_bpf, so > - * we can safely push the L2 header back before > - * redirecting to another netdev > - */ > - __skb_push(skb, skb->mac_len); > - if (skb_do_redirect(skb) == -EAGAIN) { > - __skb_pull(skb, skb->mac_len); > - *another = true; > - break; > - } > - *ret = NET_RX_SUCCESS; > - return NULL; > - case TC_ACT_CONSUMED: > - *ret = NET_RX_SUCCESS; > - return NULL; > - default: > - break; > - } > -#endif /* CONFIG_NET_CLS_ACT */ > - return skb; > -} > - > /** > * netdev_is_rx_handler_busy - check if receive handler is registered > * @dev: device to check > @@ -10873,7 +10936,7 @@ void unregister_netdevice_many_notify(struct list_head *head, > > /* Shutdown queueing discipline. */ > dev_shutdown(dev); > - > + dev_tcx_uninstall(dev); > dev_xdp_uninstall(dev); > bpf_dev_bound_netdev_unregister(dev); > > diff --git a/net/core/filter.c b/net/core/filter.c > index d25d52854c21..1ff9a0988ea6 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -9233,7 +9233,7 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog, > __u8 value_reg = si->dst_reg; > __u8 skb_reg = si->src_reg; > > -#ifdef CONFIG_NET_CLS_ACT > +#ifdef CONFIG_NET_XGRESS > /* If the tstamp_type is read, > * the bpf prog is aware the tstamp could have delivery time. > * Thus, read skb->tstamp as is if tstamp_type_access is true. > @@ -9267,7 +9267,7 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog, > __u8 value_reg = si->src_reg; > __u8 skb_reg = si->dst_reg; > > -#ifdef CONFIG_NET_CLS_ACT > +#ifdef CONFIG_NET_XGRESS > /* If the tstamp_type is read, > * the bpf prog is aware the tstamp could have delivery time. > * Thus, write skb->tstamp as is if tstamp_type_access is true. > diff --git a/net/sched/Kconfig b/net/sched/Kconfig > index 4b95cb1ac435..470c70deffe2 100644 > --- a/net/sched/Kconfig > +++ b/net/sched/Kconfig > @@ -347,8 +347,7 @@ config NET_SCH_FQ_PIE > config NET_SCH_INGRESS > tristate "Ingress/classifier-action Qdisc" > depends on NET_CLS_ACT > - select NET_INGRESS > - select NET_EGRESS > + select NET_XGRESS > help > Say Y here if you want to use classifiers for incoming and/or outgoing > packets. This qdisc doesn't do anything else besides running classifiers, > @@ -679,6 +678,7 @@ config NET_EMATCH_IPT > config NET_CLS_ACT > bool "Actions" > select NET_CLS > + select NET_XGRESS > help > Say Y here if you want to use traffic control actions. Actions > get attached to classifiers and are invoked after a successful > diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c > index 84838128b9c5..4af1360f537e 100644 > --- a/net/sched/sch_ingress.c > +++ b/net/sched/sch_ingress.c > @@ -13,6 +13,7 @@ > #include <net/netlink.h> > #include <net/pkt_sched.h> > #include <net/pkt_cls.h> > +#include <net/tcx.h> > > struct ingress_sched_data { > struct tcf_block *block; > @@ -78,11 +79,18 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt, > { > struct ingress_sched_data *q = qdisc_priv(sch); > struct net_device *dev = qdisc_dev(sch); > + struct bpf_mprog_entry *entry; > + bool created; > int err; > > net_inc_ingress_queue(); > > - mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress); > + entry = dev_tcx_entry_fetch_or_create(dev, true, &created); > + if (!entry) > + return -ENOMEM; > + mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq); > + if (created) > + tcx_entry_update(dev, entry, true); > > q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS; > q->block_info.chain_head_change = clsact_chain_head_change; > @@ -93,15 +101,20 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt, > return err; > > mini_qdisc_pair_block_init(&q->miniqp, q->block); > - > return 0; > } > > static void ingress_destroy(struct Qdisc *sch) > { > struct ingress_sched_data *q = qdisc_priv(sch); > + struct net_device *dev = qdisc_dev(sch); > + struct bpf_mprog_entry *entry = rtnl_dereference(dev->tcx_ingress); > > tcf_block_put_ext(q->block, sch, &q->block_info); > + if (entry && !bpf_mprog_total(entry)) { > + tcx_entry_update(dev, NULL, true); > + bpf_mprog_free(entry); > + } > net_dec_ingress_queue(); > } > > @@ -217,12 +230,19 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt, > { > struct clsact_sched_data *q = qdisc_priv(sch); > struct net_device *dev = qdisc_dev(sch); > + struct bpf_mprog_entry *entry; > + bool created; > int err; > > net_inc_ingress_queue(); > net_inc_egress_queue(); > > - mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress); > + entry = dev_tcx_entry_fetch_or_create(dev, true, &created); > + if (!entry) > + return -ENOMEM; > + mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq); > + if (created) > + tcx_entry_update(dev, entry, true); > > q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS; > q->ingress_block_info.chain_head_change = clsact_chain_head_change; > @@ -235,7 +255,12 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt, > > mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block); > > - mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress); > + entry = dev_tcx_entry_fetch_or_create(dev, false, &created); > + if (!entry) > + return -ENOMEM; > + mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq); > + if (created) > + tcx_entry_update(dev, entry, false); > > q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS; > q->egress_block_info.chain_head_change = clsact_chain_head_change; > @@ -247,9 +272,21 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt, > static void clsact_destroy(struct Qdisc *sch) > { > struct clsact_sched_data *q = qdisc_priv(sch); > + struct net_device *dev = qdisc_dev(sch); > + struct bpf_mprog_entry *ingress_entry = rtnl_dereference(dev->tcx_ingress); > + struct bpf_mprog_entry *egress_entry = rtnl_dereference(dev->tcx_egress); > > tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info); > + if (egress_entry && !bpf_mprog_total(egress_entry)) { > + tcx_entry_update(dev, NULL, false); > + bpf_mprog_free(egress_entry); > + } > + > tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info); > + if (ingress_entry && !bpf_mprog_total(ingress_entry)) { > + tcx_entry_update(dev, NULL, true); > + bpf_mprog_free(ingress_entry); > + } > > net_dec_ingress_queue(); > net_dec_egress_queue(); > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 207f8a37b327..e7584e24bc83 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -1035,6 +1035,8 @@ enum bpf_attach_type { > BPF_TRACE_KPROBE_MULTI, > BPF_LSM_CGROUP, > BPF_STRUCT_OPS, > + BPF_TCX_INGRESS, > + BPF_TCX_EGRESS, > __MAX_BPF_ATTACH_TYPE > }; > > @@ -1052,7 +1054,7 @@ enum bpf_link_type { > BPF_LINK_TYPE_KPROBE_MULTI = 8, > BPF_LINK_TYPE_STRUCT_OPS = 9, > BPF_LINK_TYPE_NETFILTER = 10, > - > + BPF_LINK_TYPE_TCX = 11, > MAX_BPF_LINK_TYPE, > }; > > @@ -1559,13 +1561,13 @@ union bpf_attr { > __u32 map_fd; /* struct_ops to attach */ > }; > union { > - __u32 target_fd; /* object to attach to */ > - __u32 target_ifindex; /* target ifindex */ > + __u32 target_fd; /* target object to attach to or ... */ > + __u32 target_ifindex; /* target ifindex */ > }; > __u32 attach_type; /* attach type */ > __u32 flags; /* extra flags */ > union { > - __u32 target_btf_id; /* btf_id of target to attach to */ > + __u32 target_btf_id; /* btf_id of target to attach to */ > struct { > __aligned_u64 iter_info; /* extra bpf_iter_link_info */ > __u32 iter_info_len; /* iter_info length */ > @@ -1599,6 +1601,13 @@ union bpf_attr { > __s32 priority; > __u32 flags; > } netfilter; > + struct { > + union { > + __u32 relative_fd; > + __u32 relative_id; > + }; > + __u32 expected_revision; > + } tcx; > }; > } link_create; > > @@ -6207,6 +6216,19 @@ struct bpf_sock_tuple { > }; > }; > > +/* (Simplified) user return codes for tcx prog type. > + * A valid tcx program must return one of these defined values. All other > + * return codes are reserved for future use. Must remain compatible with > + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown > + * return codes are mapped to TCX_NEXT. > + */ > +enum tcx_action_base { > + TCX_NEXT = -1, > + TCX_PASS = 0, > + TCX_DROP = 2, > + TCX_REDIRECT = 7, > +}; > + > struct bpf_xdp_sock { > __u32 queue_id; > }; > @@ -6459,6 +6481,11 @@ struct bpf_link_info { > __s32 priority; > __u32 flags; > } netfilter; > + struct { > + __u32 ifindex; > + __u32 attach_type; > + __u32 flags; > + } tcx; > }; > } __attribute__((aligned(8))); > > -- > 2.34.1 >