Re: [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/28/23 11:16 AM, Toke Høiland-Jørgensen wrote:
Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes:

This work adds a new, minimal BPF-programmable device called "meta" we
recently presented at LSF/MM/BPF. The latter name derives from the Greek
μετά, encompassing a wide array of meanings such as "on top of", "beyond".
Given business logic is defined by BPF, this device can have many meanings.
The core idea is that BPF programs are executed within the drivers xmit
routine and therefore e.g. in case of containers/Pods moving BPF processing
closer to the source.

I like the concept, but I think we should change the name (as I believe
I also mentioned back when you presented it at LSF/MM/BPF). I know this
is basically bikeshedding, but I nevertheless think it is important, for
a couple of reasons:

- As you say, meta has a specific meaning, and this device is not a
   "meta" device in the common sense of the word: it is not tied to other
   devices (so it's not 'on top of' anything), and it is not "about"
   anything (as in metadata). It is just a device type that is programmed
   by BPF, so let's call it that.

- It's not discoverable; how are people supposed to figure out that they
   should go look for a 'meta' device? We also already have multiple
   things called 'metadata', so this is just going to create even more
   confusion (as we also discussed in relation to 'xdp hints').

- It squats on a pretty widely used term throughout the kernel
   (CONFIG_META, 'meta' as the module name). This is related to the above
   point; seeing something named 'meta' in lsmod, the natural assumption
   wouldn't be that it's a network driver.

I think we should just name the driver 'bpfnet'; it's not pretty, but
it's obvious and descriptive. Optionally we could teach 'ip' to
understand just 'bpf' as the device type, so you could go 'ip link add
type bpf' and get one of these.

I'll think about it, the bpfnet sounds terrible as you also noticed. I
definitely don't like that. Perhaps meta_net as suggested by Andrii in
the other thread could be a compromise. Need to sleep over it, my pref
was actually to keep it shorter.

One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions and better
performance.

So my only reservation to having this tied to a BPF-only device like
this is basically that if this is indeed such a big win, shouldn't we
try to make the stack operate in this mode by default? I assume you did
the analysis of what it would take to change veth to operate in this
mode; so what was the reason you decided to create a new device type
instead?

There are multiple virtual device flavors and veth is not the sole one. Could
other virtual devices have been extended into veth? Perhaps, but it doesn't
mean it should. veth has very much connotation of L2 and device pair. In this
case here the core of it is around having BPF logic as part of the xmit logic
(with default policies when no BPF is attached), being able to have L3 mode
and having the option to use them as paired devices but also as just single/
standalone one which we plan to push as next step after this series.

Some comments on the code below:

--- /dev/null
+++ b/drivers/net/meta.c
@@ -0,0 +1,734 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2023 Isovalent */
+
+#include <linux/netdevice.h>
+#include <linux/ethtool.h>
+#include <linux/etherdevice.h>
+#include <linux/filter.h>
+#include <linux/netfilter_netdev.h>
+#include <linux/bpf_mprog.h>
+
+#include <net/meta.h>
+#include <net/dst.h>
+#include <net/tcx.h>
+
+#define DRV_NAME	"meta"
+#define DRV_VERSION	"1.0"

Looking at veth as an example, this will probably never get updated :)

So wouldn't it be better to use the kernel version as the driver
version? That way there will at least be some information in this field.
I guess we could make the same change for veth.

That's fine, I can change it to something more useful.

[...]
+static netdev_tx_t meta_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	enum meta_action ret = READ_ONCE(meta->policy);
+	netdev_tx_t ret_dev = NET_XMIT_SUCCESS;
+	const struct bpf_mprog_entry *entry;
+	struct net_device *peer;
+
+	rcu_read_lock();
+	peer = rcu_dereference(meta->peer);
+	if (unlikely(!peer || !(peer->flags & IFF_UP) ||
+		     !pskb_may_pull(skb, ETH_HLEN) ||
+		     skb_orphan_frags(skb, GFP_ATOMIC)))
+		goto drop;
+	meta_scrub_minimum(skb);
+	skb->dev = peer;
+	entry = rcu_dereference(meta->active);
+	if (entry)
+		ret = meta_run(meta, entry, skb, ret);
+	switch (ret) {
+	case META_NEXT:
+	case META_PASS:
+		skb->pkt_type = PACKET_HOST;
+		skb->protocol = eth_type_trans(skb, skb->dev);
+		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+		__netif_rx(skb);
+		break;
+	case META_REDIRECT:
+		skb_do_redirect(skb);
+		break;
+	case META_DROP:

Why the aliases for the constants? Might as well reuse the TCX names?

The constants are also used for the default configuration of the device
when no bpf is attached. Using tcx constant names as part of the config
is confusing, I don't see a reason why it needs to be tied together, it's
more confusing than it would help anything.

+	default:
+drop:
+		ret_dev = NET_XMIT_DROP;
+		dev_core_stats_tx_dropped_inc(dev);
+		kfree_skb(skb);
+		break;
+	}
+	rcu_read_unlock();
+	return ret_dev;
+}
+
+static int meta_open(struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer = rtnl_dereference(meta->peer);
+
+	if (!peer)
+		return -ENOTCONN;
+	if (peer->flags & IFF_UP) {
+		netif_carrier_on(dev);
+		netif_carrier_on(peer);
+	}
+	return 0;
+}
+
+static int meta_close(struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer = rtnl_dereference(meta->peer);
+
+	netif_carrier_off(dev);
+	if (peer)
+		netif_carrier_off(peer);
+	return 0;
+}
+
+static int meta_get_iflink(const struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer;
+	int iflink = 0;
+
+	rcu_read_lock();
+	peer = rcu_dereference(meta->peer);
+	if (peer)
+		iflink = peer->ifindex;
+	rcu_read_unlock();
+	return iflink;
+}
+
+static void meta_set_multicast_list(struct net_device *dev)
+{
+}

The function name indicates there is some functionality envisioned here?
Why is the function empty?

This is a stub callback to deal with multicast filter, given it's a virtual
dev and it'll receive traffic you push to w/o further config this one is
empty. See also ndo_set_rx_mode for various other virtual-only devs. I can
add a comment.

[...]
+static struct net_device *meta_dev_fetch(struct net *net, u32 ifindex, u32 which)
+{
+	struct net_device *dev;
+	struct meta *meta;
+
+	ASSERT_RTNL();
+
+	switch (which) {
+	case BPF_META_PRIMARY:
+	case BPF_META_PEER:
+		break;
+	default:
+		return ERR_PTR(-EINVAL);
+	}
+
+	dev = __dev_get_by_index(net, ifindex);
+	if (!dev)
+		return ERR_PTR(-ENODEV);
+	if (!(dev->priv_flags & IFF_META))
+		return ERR_PTR(-ENXIO);

I don't really think a new flag value is needed here? Can't you just
make this check if (dev->netdev_ops == &meta_netdev_ops) ?

Agree, very good point. Will change.

[...]
  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
@@ -3720,6 +3721,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
  		return BPF_PROG_TYPE_LSM;
  	case BPF_TCX_INGRESS:
  	case BPF_TCX_EGRESS:
+	case BPF_META_PRIMARY:
+	case BPF_META_PEER:
  		return BPF_PROG_TYPE_SCHED_CLS;
  	default:
  		return BPF_PROG_TYPE_UNSPEC;
@@ -3771,7 +3774,9 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
  		return 0;
  	case BPF_PROG_TYPE_SCHED_CLS:
  		if (attach_type != BPF_TCX_INGRESS &&
-		    attach_type != BPF_TCX_EGRESS)
+		    attach_type != BPF_TCX_EGRESS &&
+		    attach_type != BPF_META_PRIMARY &&
+		    attach_type != BPF_META_PEER)

PRIMARY and PEER basically correspond to INGRESS and EGRESS in terms of
which packets the program sees, right? So why not just reuse ingress and
egress designators, the fact that it's a "peer" attachment is mostly an
implementation detail, isn't it? Or should 'mirred' redirection to the
device inside a container also be supported? (is it?)

No, ingress/egress is higly confusing here given it can have many meanings.
You can ingress into the container or ingress into the host, for example, so
it is not clear without more context. Also in a next step we plan to make
this device configurable as a single device instead of peered. Then it's
only 'primary' available where you attach to, much simpler to reason about
from a mental model.

Reusing it (and special-casing the tcx attachment) would prevent people
from accidentally attaching a tcx program on top of the device (which
AFAICT if otherwise possible, right?). Or maybe this is a feature?

You can use tcx with it just fine and maybe some people have a need for it,
for example, for implementing logic within the container. There is certainly
no reason to prevent that.

Thanks,
Daniel




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux