Re: [PATCH bpf-next v3 1/7] netkit, bpf: Add bpf programmable net device

Stanislav Fomichev <sdf@xxxxxxxxxx> · Tue, 24 Oct 2023 09:40:11 -0700

On 10/23, Daniel Borkmann wrote:
> This work adds a new, minimal BPF-programmable device called "netkit"
> (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
> core idea is that BPF programs are executed within the drivers xmit routine
> and therefore e.g. in case of containers/Pods moving BPF processing closer
> to the source.
> 
> One of the goals was that in case of Pod egress traffic, this allows to
> move BPF programs from hostns tcx ingress into the device itself, providing
> earlier drop or forward mechanisms, for example, if the BPF program
> determines that the skb must be sent out of the node, then a redirect to
> the physical device can take place directly without going through per-CPU
> backlog queue. This helps to shift processing for such traffic from softirq
> to process context, leading to better scheduling decisions/performance (see
> measurements in the slides).
> 
> In this initial version, the netkit device ships as a pair, but we plan to
> extend this further so it can also operate in single device mode. The pair
> comes with a primary and a peer device. Only the primary device, typically
> residing in hostns, can manage BPF programs for itself and its peer. The
> peer device is designated for containers/Pods and cannot attach/detach
> BPF programs. Upon the device creation, the user can set the default policy
> to 'forward' or 'drop' for the case when no BPF program is attached.
> 
> Additionally, the device can be operated in L3 (default) or L2 mode. The
> management of BPF programs is done via bpf_mprog, so that multi-attach is
> supported right from the beginning with similar API and dependency controls
> as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
> attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
> so that existing programs can be easily migrated.
> 
> Going forward, we plan to use netkit devices in Cilium as the main device
> type for connecting Pods. They will be operated in L3 mode in order to
> simplify a Pod's neighbor management and the peer will operate in default
> drop mode, so that no traffic is leaving between the time when a Pod is
> brought up by the CNI plugin and programs attached by the agent.
> Additionally, the programs we attach via tcx on the physical devices are
> using bpf_redirect_peer() for inbound traffic into netkit device, hence the
> latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
> bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
> directly. Also, BIG TCP is supported on netkit device. For the follow-up
> work in single device mode, we plan to convert Cilium's cilium_host/_net
> devices into a single one.
> 
> An extensive test suite for checking device operations and the BPF program
> and link management API comes as BPF selftests in this series.
> 
> Co-developed-by: Nikolay Aleksandrov <razor@xxxxxxxxxxxxx>
> Signed-off-by: Nikolay Aleksandrov <razor@xxxxxxxxxxxxx>
> Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> Link: https://github.com/borkmann/iproute2/tree/pr/netkit
> Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
> ---
>  MAINTAINERS                    |   9 +
>  drivers/net/Kconfig            |   9 +
>  drivers/net/Makefile           |   1 +
>  drivers/net/netkit.c           | 934 +++++++++++++++++++++++++++++++++
>  include/net/netkit.h           |  38 ++
>  include/uapi/linux/bpf.h       |  14 +
>  include/uapi/linux/if_link.h   |  24 +
>  kernel/bpf/syscall.c           |  30 +-
>  tools/include/uapi/linux/bpf.h |  14 +
>  9 files changed, 1068 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/net/netkit.c
>  create mode 100644 include/net/netkit.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ed33b9df8b3d..43be6197e655 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3795,6 +3795,15 @@ L:	bpf@xxxxxxxxxxxxxxx
>  S:	Odd Fixes
>  K:	(?:\b|_)bpf(?:\b|_)
>  
> +BPF [NETKIT] (BPF-programmable network device)
> +M:	Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> +M:	Nikolay Aleksandrov <razor@xxxxxxxxxxxxx>
> +L:	bpf@xxxxxxxxxxxxxxx
> +L:	netdev@xxxxxxxxxxxxxxx
> +S:	Supported
> +F:	drivers/net/netkit.c
> +F:	include/net/netkit.h
> +
>  BPF [NETWORKING] (struct_ops, reuseport)
>  M:	Martin KaFai Lau <martin.lau@xxxxxxxxx>
>  L:	bpf@xxxxxxxxxxxxxxx
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 44eeb5d61ba9..af0da4bb429b 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -448,6 +448,15 @@ config NLMON
>  	  diagnostics, etc. This is mostly intended for developers or support
>  	  to debug netlink issues. If unsure, say N.
>  
> +config NETKIT
> +	bool "BPF-programmable network device"
> +	depends on BPF_SYSCALL
> +	help
> +	  The netkit device is a virtual networking device where BPF programs
> +	  can be attached to the device(s) transmission routine in order to
> +	  implement the driver's internal logic. The device can be configured
> +	  to operate in L3 or L2 mode. If unsure, say N.
> +
>  config NET_VRF
>  	tristate "Virtual Routing and Forwarding (Lite)"
>  	depends on IP_MULTIPLE_TABLES
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 8a83db32509d..7cab36f94782 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -22,6 +22,7 @@ obj-$(CONFIG_MDIO) += mdio.o
>  obj-$(CONFIG_NET) += loopback.o
>  obj-$(CONFIG_NETDEV_LEGACY_INIT) += Space.o
>  obj-$(CONFIG_NETCONSOLE) += netconsole.o
> +obj-$(CONFIG_NETKIT) += netkit.o
>  obj-y += phy/
>  obj-y += pse-pd/
>  obj-y += mdio/
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> new file mode 100644
> index 000000000000..faf756702aa1
> --- /dev/null
> +++ b/drivers/net/netkit.c
> @@ -0,0 +1,934 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/netdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/etherdevice.h>
> +#include <linux/filter.h>
> +#include <linux/netfilter_netdev.h>
> +#include <linux/bpf_mprog.h>
> +
> +#include <net/netkit.h>
> +#include <net/dst.h>
> +#include <net/tcx.h>
> +
> +#define DRV_NAME "netkit"
> +
> +struct netkit {
> +	/* Needed in fast-path */
> +	struct net_device __rcu *peer;
> +	struct bpf_mprog_entry __rcu *active;
> +	enum netkit_action policy;
> +	struct bpf_mprog_bundle	bundle;
> +
> +	/* Needed in slow-path */
> +	enum netkit_mode mode;
> +	bool primary;
> +	u32 headroom;
> +};
> +
> +struct netkit_link {
> +	struct bpf_link link;
> +	struct net_device *dev;
> +	u32 location;
> +};
> +
> +static __always_inline int
> +netkit_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
> +	   enum netkit_action ret)
> +{
> +	const struct bpf_mprog_fp *fp;
> +	const struct bpf_prog *prog;
> +
> +	bpf_mprog_foreach_prog(entry, fp, prog) {
> +		bpf_compute_data_pointers(skb);
> +		ret = bpf_prog_run(prog, skb);
> +		if (ret != NETKIT_NEXT)
> +			break;
> +	}
> +	return ret;
> +}
> +
> +static void netkit_prep_forward(struct sk_buff *skb, bool xnet)
> +{
> +	skb_scrub_packet(skb, xnet);
> +	skb->priority = 0;
> +	nf_skip_egress(skb, true);
> +}
> +
> +static struct netkit *netkit_priv(const struct net_device *dev)
> +{
> +	return netdev_priv(dev);
> +}
> +
> +static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct netkit *nk = netkit_priv(dev);
> +	enum netkit_action ret = READ_ONCE(nk->policy);
> +	netdev_tx_t ret_dev = NET_XMIT_SUCCESS;
> +	const struct bpf_mprog_entry *entry;
> +	struct net_device *peer;
> +
> +	rcu_read_lock();
> +	peer = rcu_dereference(nk->peer);
> +	if (unlikely(!peer || !(peer->flags & IFF_UP) ||
> +		     !pskb_may_pull(skb, ETH_HLEN) ||
> +		     skb_orphan_frags(skb, GFP_ATOMIC)))
> +		goto drop;
> +	netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
> +	skb->dev = peer;
> +	entry = rcu_dereference(nk->active);
> +	if (entry)
> +		ret = netkit_run(entry, skb, ret);
> +	switch (ret) {
> +	case NETKIT_NEXT:
> +	case NETKIT_PASS:
> +		skb->protocol = eth_type_trans(skb, skb->dev);
> +		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> +		__netif_rx(skb);
> +		break;
> +	case NETKIT_REDIRECT:
> +		skb_do_redirect(skb);
> +		break;
> +	case NETKIT_DROP:
> +	default:
> +drop:
> +		kfree_skb(skb);
> +		dev_core_stats_tx_dropped_inc(dev);
> +		ret_dev = NET_XMIT_DROP;
> +		break;
> +	}
> +	rcu_read_unlock();
> +	return ret_dev;
> +}
> +
> +static int netkit_open(struct net_device *dev)
> +{
> +	struct netkit *nk = netkit_priv(dev);
> +	struct net_device *peer = rtnl_dereference(nk->peer);
> +
> +	if (!peer)
> +		return -ENOTCONN;
> +	if (peer->flags & IFF_UP) {
> +		netif_carrier_on(dev);
> +		netif_carrier_on(peer);
> +	}
> +	return 0;
> +}
> +
> +static int netkit_close(struct net_device *dev)
> +{
> +	struct netkit *nk = netkit_priv(dev);
> +	struct net_device *peer = rtnl_dereference(nk->peer);
> +
> +	netif_carrier_off(dev);
> +	if (peer)
> +		netif_carrier_off(peer);
> +	return 0;
> +}
> +
> +static int netkit_get_iflink(const struct net_device *dev)
> +{
> +	struct netkit *nk = netkit_priv(dev);
> +	struct net_device *peer;
> +	int iflink = 0;
> +
> +	rcu_read_lock();
> +	peer = rcu_dereference(nk->peer);
> +	if (peer)
> +		iflink = peer->ifindex;
> +	rcu_read_unlock();
> +	return iflink;
> +}
> +
> +static void netkit_set_multicast(struct net_device *dev)
> +{
> +	/* Nothing to do, we receive whatever gets pushed to us! */
> +}
> +
> +static void netkit_set_headroom(struct net_device *dev, int headroom)
> +{
> +	struct netkit *nk = netkit_priv(dev), *nk2;
> +	struct net_device *peer;
> +
> +	if (headroom < 0)
> +		headroom = NET_SKB_PAD;
> +
> +	rcu_read_lock();
> +	peer = rcu_dereference(nk->peer);
> +	if (unlikely(!peer))
> +		goto out;
> +
> +	nk2 = netkit_priv(peer);
> +	nk->headroom = headroom;
> +	headroom = max(nk->headroom, nk2->headroom);
> +
> +	peer->needed_headroom = headroom;
> +	dev->needed_headroom = headroom;
> +out:
> +	rcu_read_unlock();
> +}
> +
> +static struct net_device *netkit_peer_dev(struct net_device *dev)
> +{
> +	return rcu_dereference(netkit_priv(dev)->peer);
> +}
> +
> +static const struct net_device_ops netkit_netdev_ops = {
> +	.ndo_open		= netkit_open,
> +	.ndo_stop		= netkit_close,
> +	.ndo_start_xmit		= netkit_xmit,
> +	.ndo_set_rx_mode	= netkit_set_multicast,
> +	.ndo_set_rx_headroom	= netkit_set_headroom,
> +	.ndo_get_iflink		= netkit_get_iflink,
> +	.ndo_get_peer_dev	= netkit_peer_dev,
> +	.ndo_features_check	= passthru_features_check,
> +};
> +
> +static void netkit_get_drvinfo(struct net_device *dev,
> +			       struct ethtool_drvinfo *info)
> +{
> +	strscpy(info->driver, DRV_NAME, sizeof(info->driver));
> +}
> +
> +static const struct ethtool_ops netkit_ethtool_ops = {
> +	.get_drvinfo		= netkit_get_drvinfo,
> +};
> +
> +static void netkit_setup(struct net_device *dev)
> +{
> +	static const netdev_features_t netkit_features_hw_vlan =
> +		NETIF_F_HW_VLAN_CTAG_TX |
> +		NETIF_F_HW_VLAN_CTAG_RX |
> +		NETIF_F_HW_VLAN_STAG_TX |
> +		NETIF_F_HW_VLAN_STAG_RX;
> +	static const netdev_features_t netkit_features =
> +		netkit_features_hw_vlan |
> +		NETIF_F_SG |
> +		NETIF_F_FRAGLIST |
> +		NETIF_F_HW_CSUM |
> +		NETIF_F_RXCSUM |
> +		NETIF_F_SCTP_CRC |
> +		NETIF_F_HIGHDMA |
> +		NETIF_F_GSO_SOFTWARE |
> +		NETIF_F_GSO_ENCAP_ALL;
> +
> +	ether_setup(dev);
> +	dev->max_mtu = ETH_MAX_MTU;
> +
> +	dev->flags |= IFF_NOARP;
> +	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
> +	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
> +	dev->priv_flags |= IFF_PHONY_HEADROOM;
> +	dev->priv_flags |= IFF_NO_QUEUE;
> +
> +	dev->ethtool_ops = &netkit_ethtool_ops;
> +	dev->netdev_ops  = &netkit_netdev_ops;
> +
> +	dev->features |= netkit_features | NETIF_F_LLTX;
> +	dev->hw_features = netkit_features;
> +	dev->hw_enc_features = netkit_features;
> +	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
> +	dev->vlan_features = dev->features & ~netkit_features_hw_vlan;
> +
> +	dev->needs_free_netdev = true;
> +
> +	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
> +}
> +
> +static struct net *netkit_get_link_net(const struct net_device *dev)
> +{
> +	struct netkit *nk = netkit_priv(dev);
> +	struct net_device *peer = rtnl_dereference(nk->peer);
> +
> +	return peer ? dev_net(peer) : dev_net(dev);
> +}
> +
> +static int netkit_check_policy(int policy, struct nlattr *tb,
> +			       struct netlink_ext_ack *extack)
> +{
> +	switch (policy) {
> +	case NETKIT_PASS:
> +	case NETKIT_DROP:
> +		return 0;
> +	default:
> +		NL_SET_ERR_MSG_ATTR(extack, tb,
> +				    "Provided default xmit policy not supported");
> +		return -EINVAL;
> +	}
> +}
> +
> +static int netkit_check_mode(int mode, struct nlattr *tb,
> +			     struct netlink_ext_ack *extack)
> +{
> +	switch (mode) {
> +	case NETKIT_L2:
> +	case NETKIT_L3:
> +		return 0;
> +	default:
> +		NL_SET_ERR_MSG_ATTR(extack, tb,
> +				    "Provided device mode can only be L2 or L3");
> +		return -EINVAL;
> +	}
> +}
> +
> +static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
> +			   struct netlink_ext_ack *extack)
> +{
> +	struct nlattr *attr = tb[IFLA_ADDRESS];
> +
> +	if (!attr)
> +		return 0;
> +	NL_SET_ERR_MSG_ATTR(extack, attr,
> +			    "Setting Ethernet address is not supported");
> +	return -EOPNOTSUPP;
> +}
> +
> +static struct rtnl_link_ops netkit_link_ops;
> +
> +static int netkit_new_link(struct net *src_net, struct net_device *dev,
> +			   struct nlattr *tb[], struct nlattr *data[],
> +			   struct netlink_ext_ack *extack)
> +{
> +	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
> +	enum netkit_action default_prim = NETKIT_PASS;
> +	enum netkit_action default_peer = NETKIT_PASS;
> +	enum netkit_mode mode = NETKIT_L3;
> +	unsigned char ifname_assign_type;
> +	struct ifinfomsg *ifmp = NULL;
> +	struct net_device *peer;
> +	char ifname[IFNAMSIZ];
> +	struct netkit *nk;
> +	struct net *net;
> +	int err;
> +
> +	if (data) {
> +		if (data[IFLA_NETKIT_MODE]) {
> +			attr = data[IFLA_NETKIT_MODE];
> +			mode = nla_get_u32(attr);
> +			err = netkit_check_mode(mode, attr, extack);
> +			if (err < 0)
> +				return err;
> +		}
> +		if (data[IFLA_NETKIT_PEER_INFO]) {
> +			attr = data[IFLA_NETKIT_PEER_INFO];
> +			ifmp = nla_data(attr);
> +			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
> +			if (err < 0)
> +				return err;
> +			err = netkit_validate(peer_tb, NULL, extack);
> +			if (err < 0)
> +				return err;
> +			tbp = peer_tb;
> +		}
> +		if (data[IFLA_NETKIT_POLICY]) {
> +			attr = data[IFLA_NETKIT_POLICY];
> +			default_prim = nla_get_u32(attr);
> +			err = netkit_check_policy(default_prim, attr, extack);
> +			if (err < 0)
> +				return err;
> +		}
> +		if (data[IFLA_NETKIT_PEER_POLICY]) {
> +			attr = data[IFLA_NETKIT_PEER_POLICY];
> +			default_peer = nla_get_u32(attr);
> +			err = netkit_check_policy(default_peer, attr, extack);
> +			if (err < 0)
> +				return err;
> +		}
> +	}
> +
> +	if (ifmp && tbp[IFLA_IFNAME]) {
> +		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
> +		ifname_assign_type = NET_NAME_USER;
> +	} else {
> +		strscpy(ifname, "nk%d", IFNAMSIZ);
> +		ifname_assign_type = NET_NAME_ENUM;
> +	}
> +
> +	net = rtnl_link_get_net(src_net, tbp);
> +	if (IS_ERR(net))
> +		return PTR_ERR(net);
> +
> +	peer = rtnl_create_link(net, ifname, ifname_assign_type,
> +				&netkit_link_ops, tbp, extack);
> +	if (IS_ERR(peer)) {
> +		put_net(net);
> +		return PTR_ERR(peer);
> +	}
> +
> +	netif_inherit_tso_max(peer, dev);
> +
> +	if (mode == NETKIT_L2)
> +		eth_hw_addr_random(peer);
> +	if (ifmp && dev->ifindex)
> +		peer->ifindex = ifmp->ifi_index;
> +
> +	nk = netkit_priv(peer);
> +	nk->primary = false;
> +	nk->policy = default_peer;
> +	nk->mode = mode;
> +	bpf_mprog_bundle_init(&nk->bundle);
> +	RCU_INIT_POINTER(nk->active, NULL);
> +	RCU_INIT_POINTER(nk->peer, NULL);
> +
> +	err = register_netdevice(peer);
> +	put_net(net);
> +	if (err < 0)
> +		goto err_register_peer;
> +	netif_carrier_off(peer);
> +	if (mode == NETKIT_L2)
> +		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
> +
> +	err = rtnl_configure_link(peer, NULL, 0, NULL);
> +	if (err < 0)
> +		goto err_configure_peer;
> +
> +	if (mode == NETKIT_L2)
> +		eth_hw_addr_random(dev);
> +	if (tb[IFLA_IFNAME])
> +		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
> +	else
> +		strscpy(dev->name, "nk%d", IFNAMSIZ);
> +
> +	nk = netkit_priv(dev);
> +	nk->primary = true;
> +	nk->policy = default_prim;
> +	nk->mode = mode;
> +	bpf_mprog_bundle_init(&nk->bundle);
> +	RCU_INIT_POINTER(nk->active, NULL);
> +	RCU_INIT_POINTER(nk->peer, NULL);
> +
> +	err = register_netdevice(dev);
> +	if (err < 0)
> +		goto err_configure_peer;
> +	netif_carrier_off(dev);
> +	if (mode == NETKIT_L2)
> +		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
> +
> +	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
> +	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
> +	return 0;
> +err_configure_peer:
> +	unregister_netdevice(peer);
> +	return err;
> +err_register_peer:
> +	free_netdev(peer);
> +	return err;
> +}
> +
> +static struct bpf_mprog_entry *netkit_entry_fetch(struct net_device *dev,
> +						  bool bundle_fallback)
> +{
> +	struct netkit *nk = netkit_priv(dev);
> +	struct bpf_mprog_entry *entry;
> +
> +	ASSERT_RTNL();
> +	entry = rcu_dereference_rtnl(nk->active);
> +	if (entry)
> +		return entry;
> +	if (bundle_fallback)
> +		return &nk->bundle.a;
> +	return NULL;
> +}
> +
> +static void netkit_entry_update(struct net_device *dev,
> +				struct bpf_mprog_entry *entry)
> +{
> +	struct netkit *nk = netkit_priv(dev);
> +
> +	ASSERT_RTNL();
> +	rcu_assign_pointer(nk->active, entry);
> +}
> +
> +static void netkit_entry_sync(void)
> +{
> +	synchronize_rcu();
> +}
> +
> +static struct net_device *netkit_dev_fetch(struct net *net, u32 ifindex, u32 which)
> +{
> +	struct net_device *dev;
> +	struct netkit *nk;
> +
> +	ASSERT_RTNL();
> +
> +	switch (which) {
> +	case BPF_NETKIT_PRIMARY:
> +	case BPF_NETKIT_PEER:
> +		break;
> +	default:
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	dev = __dev_get_by_index(net, ifindex);
> +	if (!dev)
> +		return ERR_PTR(-ENODEV);
> +	if (dev->netdev_ops != &netkit_netdev_ops)
> +		return ERR_PTR(-ENXIO);
> +
> +	nk = netkit_priv(dev);
> +	if (!nk->primary)
> +		return ERR_PTR(-EACCES);
> +	if (which == BPF_NETKIT_PEER) {
> +		dev = rcu_dereference_rtnl(nk->peer);
> +		if (!dev)
> +			return ERR_PTR(-ENODEV);
> +	}
> +	return dev;
> +}
> +
> +int netkit_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_mprog_entry *entry, *entry_new;
> +	struct bpf_prog *replace_prog = NULL;
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = netkit_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
> +			       attr->attach_type);
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto out;
> +	}
> +	entry = netkit_entry_fetch(dev, true);
> +	if (attr->attach_flags & BPF_F_REPLACE) {
> +		replace_prog = bpf_prog_get_type(attr->replace_bpf_fd,
> +						 prog->type);
> +		if (IS_ERR(replace_prog)) {
> +			ret = PTR_ERR(replace_prog);
> +			replace_prog = NULL;
> +			goto out;
> +		}
> +	}
> +	ret = bpf_mprog_attach(entry, &entry_new, prog, NULL, replace_prog,
> +			       attr->attach_flags, attr->relative_fd,
> +			       attr->expected_revision);
> +	if (!ret) {
> +		if (entry != entry_new) {
> +			netkit_entry_update(dev, entry_new);
> +			netkit_entry_sync();
> +		}
> +		bpf_mprog_commit(entry);
> +	}
> +out:
> +	if (replace_prog)
> +		bpf_prog_put(replace_prog);
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +int netkit_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_mprog_entry *entry, *entry_new;
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = netkit_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
> +			       attr->attach_type);
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto out;
> +	}
> +	entry = netkit_entry_fetch(dev, false);
> +	if (!entry) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = bpf_mprog_detach(entry, &entry_new, prog, NULL, attr->attach_flags,
> +			       attr->relative_fd, attr->expected_revision);
> +	if (!ret) {
> +		if (!bpf_mprog_total(entry_new))
> +			entry_new = NULL;
> +		netkit_entry_update(dev, entry_new);
> +		netkit_entry_sync();
> +		bpf_mprog_commit(entry);
> +	}
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +int netkit_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
> +{
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = netkit_dev_fetch(current->nsproxy->net_ns,
> +			       attr->query.target_ifindex,
> +			       attr->query.attach_type);
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto out;
> +	}
> +	ret = bpf_mprog_query(attr, uattr, netkit_entry_fetch(dev, false));
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +static struct netkit_link *netkit_link(const struct bpf_link *link)
> +{
> +	return container_of(link, struct netkit_link, link);
> +}
> +
> +static int netkit_link_prog_attach(struct bpf_link *link, u32 flags,
> +				   u32 id_or_fd, u64 revision)
> +{
> +	struct netkit_link *nkl = netkit_link(link);
> +	struct bpf_mprog_entry *entry, *entry_new;
> +	struct net_device *dev = nkl->dev;
> +	int ret;
> +
> +	ASSERT_RTNL();
> +	entry = netkit_entry_fetch(dev, true);
> +	ret = bpf_mprog_attach(entry, &entry_new, link->prog, link, NULL, flags,
> +			       id_or_fd, revision);
> +	if (!ret) {
> +		if (entry != entry_new) {
> +			netkit_entry_update(dev, entry_new);
> +			netkit_entry_sync();
> +		}
> +		bpf_mprog_commit(entry);
> +	}
> +	return ret;
> +}
> +
> +static void netkit_link_release(struct bpf_link *link)
> +{
> +	struct netkit_link *nkl = netkit_link(link);
> +	struct bpf_mprog_entry *entry, *entry_new;
> +	struct net_device *dev;
> +	int ret = 0;
> +
> +	rtnl_lock();
> +	dev = nkl->dev;
> +	if (!dev)
> +		goto out;
> +	entry = netkit_entry_fetch(dev, false);
> +	if (!entry) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = bpf_mprog_detach(entry, &entry_new, link->prog, link, 0, 0, 0);
> +	if (!ret) {
> +		if (!bpf_mprog_total(entry_new))
> +			entry_new = NULL;
> +		netkit_entry_update(dev, entry_new);
> +		netkit_entry_sync();
> +		bpf_mprog_commit(entry);
> +		nkl->dev = NULL;
> +	}
> +out:
> +	WARN_ON_ONCE(ret);
> +	rtnl_unlock();
> +}
> +
> +static int netkit_link_update(struct bpf_link *link, struct bpf_prog *nprog,
> +			      struct bpf_prog *oprog)
> +{
> +	struct netkit_link *nkl = netkit_link(link);
> +	struct bpf_mprog_entry *entry, *entry_new;
> +	struct net_device *dev;
> +	int ret = 0;
> +
> +	rtnl_lock();
> +	dev = nkl->dev;
> +	if (!dev) {
> +		ret = -ENOLINK;
> +		goto out;
> +	}
> +	if (oprog && link->prog != oprog) {
> +		ret = -EPERM;
> +		goto out;
> +	}
> +	oprog = link->prog;
> +	if (oprog == nprog) {
> +		bpf_prog_put(nprog);
> +		goto out;
> +	}
> +	entry = netkit_entry_fetch(dev, false);
> +	if (!entry) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = bpf_mprog_attach(entry, &entry_new, nprog, link, oprog,
> +			       BPF_F_REPLACE | BPF_F_ID,
> +			       link->prog->aux->id, 0);
> +	if (!ret) {
> +		WARN_ON_ONCE(entry != entry_new);
> +		oprog = xchg(&link->prog, nprog);
> +		bpf_prog_put(oprog);
> +		bpf_mprog_commit(entry);
> +	}
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +static void netkit_link_dealloc(struct bpf_link *link)
> +{
> +	kfree(netkit_link(link));
> +}
> +
> +static void netkit_link_fdinfo(const struct bpf_link *link, struct seq_file *seq)
> +{
> +	const struct netkit_link *nkl = netkit_link(link);
> +	u32 ifindex = 0;
> +
> +	rtnl_lock();
> +	if (nkl->dev)
> +		ifindex = nkl->dev->ifindex;
> +	rtnl_unlock();
> +
> +	seq_printf(seq, "ifindex:\t%u\n", ifindex);
> +	seq_printf(seq, "attach_type:\t%u (%s)\n",
> +		   nkl->location,
> +		   nkl->location == BPF_NETKIT_PRIMARY ? "primary" : "peer");
> +}
> +
> +static int netkit_link_fill_info(const struct bpf_link *link,
> +				 struct bpf_link_info *info)
> +{
> +	const struct netkit_link *nkl = netkit_link(link);
> +	u32 ifindex = 0;
> +
> +	rtnl_lock();
> +	if (nkl->dev)
> +		ifindex = nkl->dev->ifindex;
> +	rtnl_unlock();
> +
> +	info->netkit.ifindex = ifindex;
> +	info->netkit.attach_type = nkl->location;
> +	return 0;
> +}
> +
> +static int netkit_link_detach(struct bpf_link *link)
> +{
> +	netkit_link_release(link);
> +	return 0;
> +}
> +
> +static const struct bpf_link_ops netkit_link_lops = {
> +	.release	= netkit_link_release,
> +	.detach		= netkit_link_detach,
> +	.dealloc	= netkit_link_dealloc,
> +	.update_prog	= netkit_link_update,
> +	.show_fdinfo	= netkit_link_fdinfo,
> +	.fill_link_info	= netkit_link_fill_info,
> +};
> +
> +static int netkit_link_init(struct netkit_link *nkl,
> +			    struct bpf_link_primer *link_primer,
> +			    const union bpf_attr *attr,
> +			    struct net_device *dev,
> +			    struct bpf_prog *prog)
> +{
> +	bpf_link_init(&nkl->link, BPF_LINK_TYPE_NETKIT,
> +		      &netkit_link_lops, prog);
> +	nkl->location = attr->link_create.attach_type;
> +	nkl->dev = dev;
> +	return bpf_link_prime(&nkl->link, link_primer);
> +}
> +
> +int netkit_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_link_primer link_primer;
> +	struct netkit_link *nkl;
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = netkit_dev_fetch(current->nsproxy->net_ns,
> +			       attr->link_create.target_ifindex,
> +			       attr->link_create.attach_type);
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto out;
> +	}
> +	nkl = kzalloc(sizeof(*nkl), GFP_KERNEL_ACCOUNT);
> +	if (!nkl) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	ret = netkit_link_init(nkl, &link_primer, attr, dev, prog);
> +	if (ret) {
> +		kfree(nkl);
> +		goto out;
> +	}

The series looks great! FWIW:
Acked-by: Stanislav Fomichev <sdf@xxxxxxxxxx>

One small question I have is:
We now (and after introduction of tcx) seem to store non-refcounted
dev pointers in the bpf_link(s). Is it guaranteed that the dev will
outlive the link?

> +	ret = netkit_link_prog_attach(&nkl->link,
> +				      attr->link_create.flags,
> +				      attr->link_create.netkit.relative_fd,
> +				      attr->link_create.netkit.expected_revision);
> +	if (ret) {
> +		nkl->dev = NULL;
> +		bpf_link_cleanup(&link_primer);
> +		goto out;

What happens to nkl here? Do we leak it?

> +	}
> +	ret = bpf_link_settle(&link_primer);
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}