Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@xxxxxxxxxxxxx wrote: >This work adds a new, minimal BPF-programmable device called "netkit" >(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The >core idea is that BPF programs are executed within the drivers xmit routine >and therefore e.g. in case of containers/Pods moving BPF processing closer >to the source. > >One of the goals was that in case of Pod egress traffic, this allows to >move BPF programs from hostns tcx ingress into the device itself, providing >earlier drop or forward mechanisms, for example, if the BPF program >determines that the skb must be sent out of the node, then a redirect to >the physical device can take place directly without going through per-CPU >backlog queue. This helps to shift processing for such traffic from softirq >to process context, leading to better scheduling decisions/performance (see >measurements in the slides). > >In this initial version, the netkit device ships as a pair, but we plan to >extend this further so it can also operate in single device mode. The pair Single device mode should work how? >comes with a primary and a peer device. Only the primary device, typically >residing in hostns, can manage BPF programs for itself and its peer. The >peer device is designated for containers/Pods and cannot attach/detach >BPF programs. Upon the device creation, the user can set the default policy >to 'pass' or 'drop' for the case when no BPF program is attached. It looks to me that you only need the host (primary) netdevice to be used as a handle to attach the bpf programs. Because the bpf program can (and probably in real use case will) redirect to uplink/another pod netdevice skipping the host (primary) netdevice, correct? If so, why can't you do just single device mode from start finding a different sort of bpf attach handle? (not sure which) > >Additionally, the device can be operated in L3 (default) or L2 mode. The >management of BPF programs is done via bpf_mprog, so that multi-attach is >supported right from the beginning with similar API and dependency controls >as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic >attach/detach/query API for multi-progs"). tc BPF compatibility is provided, >so that existing programs can be easily migrated. > >Going forward, we plan to use netkit devices in Cilium as the main device >type for connecting Pods. They will be operated in L3 mode in order to >simplify a Pod's neighbor management and the peer will operate in default >drop mode, so that no traffic is leaving between the time when a Pod is >brought up by the CNI plugin and programs attached by the agent. >Additionally, the programs we attach via tcx on the physical devices are >using bpf_redirect_peer() for inbound traffic into netkit device, hence the >latter is also supporting the ndo_get_peer_dev callback. Similarly, we use >bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device >directly. Also, BIG TCP is supported on netkit device. For the follow-up >work in single device mode, we plan to convert Cilium's cilium_host/_net >devices into a single one. > >An extensive test suite for checking device operations and the BPF program >and link management API comes as BPF selftests in this series. > Couple of nitpicks below: [..] >+static int netkit_check_policy(int policy, struct nlattr *tb, >+ struct netlink_ext_ack *extack) >+{ >+ switch (policy) { >+ case NETKIT_PASS: >+ case NETKIT_DROP: >+ return 0; >+ default: Isn't this job for netlink policy? >+ NL_SET_ERR_MSG_ATTR(extack, tb, >+ "Provided default xmit policy not supported"); >+ return -EINVAL; >+ } >+} >+ >+static int netkit_check_mode(int mode, struct nlattr *tb, >+ struct netlink_ext_ack *extack) >+{ >+ switch (mode) { >+ case NETKIT_L2: >+ case NETKIT_L3: >+ return 0; >+ default: Isn't this job for netlink policy? >+ NL_SET_ERR_MSG_ATTR(extack, tb, >+ "Provided device mode can only be L2 or L3"); >+ return -EINVAL; >+ } >+} >+ >+static int netkit_validate(struct nlattr *tb[], struct nlattr *data[], >+ struct netlink_ext_ack *extack) >+{ >+ struct nlattr *attr = tb[IFLA_ADDRESS]; >+ >+ if (!attr) >+ return 0; >+ NL_SET_ERR_MSG_ATTR(extack, attr, >+ "Setting Ethernet address is not supported"); >+ return -EOPNOTSUPP; >+} >+ >+static struct rtnl_link_ops netkit_link_ops; >+ >+static int netkit_new_link(struct net *src_net, struct net_device *dev, >+ struct nlattr *tb[], struct nlattr *data[], >+ struct netlink_ext_ack *extack) >+{ >+ struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr; >+ enum netkit_action default_prim = NETKIT_PASS; >+ enum netkit_action default_peer = NETKIT_PASS; >+ enum netkit_mode mode = NETKIT_L3; >+ unsigned char ifname_assign_type; >+ struct ifinfomsg *ifmp = NULL; >+ struct net_device *peer; >+ char ifname[IFNAMSIZ]; >+ struct netkit *nk; >+ struct net *net; >+ int err; >+ >+ if (data) { >+ if (data[IFLA_NETKIT_MODE]) { >+ attr = data[IFLA_NETKIT_MODE]; >+ mode = nla_get_u32(attr); >+ err = netkit_check_mode(mode, attr, extack); >+ if (err < 0) >+ return err; >+ } >+ if (data[IFLA_NETKIT_PEER_INFO]) { >+ attr = data[IFLA_NETKIT_PEER_INFO]; >+ ifmp = nla_data(attr); >+ err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack); >+ if (err < 0) >+ return err; >+ err = netkit_validate(peer_tb, NULL, extack); >+ if (err < 0) >+ return err; >+ tbp = peer_tb; >+ } >+ if (data[IFLA_NETKIT_POLICY]) { >+ attr = data[IFLA_NETKIT_POLICY]; >+ default_prim = nla_get_u32(attr); >+ err = netkit_check_policy(default_prim, attr, extack); >+ if (err < 0) >+ return err; >+ } >+ if (data[IFLA_NETKIT_PEER_POLICY]) { >+ attr = data[IFLA_NETKIT_PEER_POLICY]; >+ default_peer = nla_get_u32(attr); >+ err = netkit_check_policy(default_peer, attr, extack); >+ if (err < 0) >+ return err; >+ } >+ } >+ >+ if (ifmp && tbp[IFLA_IFNAME]) { >+ nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ); >+ ifname_assign_type = NET_NAME_USER; >+ } else { >+ strscpy(ifname, "nk%d", IFNAMSIZ); >+ ifname_assign_type = NET_NAME_ENUM; >+ } >+ >+ net = rtnl_link_get_net(src_net, tbp); >+ if (IS_ERR(net)) >+ return PTR_ERR(net); >+ >+ peer = rtnl_create_link(net, ifname, ifname_assign_type, >+ &netkit_link_ops, tbp, extack); >+ if (IS_ERR(peer)) { >+ put_net(net); >+ return PTR_ERR(peer); >+ } >+ >+ netif_inherit_tso_max(peer, dev); >+ >+ if (mode == NETKIT_L2) >+ eth_hw_addr_random(peer); >+ if (ifmp && dev->ifindex) >+ peer->ifindex = ifmp->ifi_index; >+ >+ nk = netkit_priv(peer); >+ nk->primary = false; >+ nk->policy = default_peer; >+ nk->mode = mode; >+ bpf_mprog_bundle_init(&nk->bundle); >+ RCU_INIT_POINTER(nk->active, NULL); >+ RCU_INIT_POINTER(nk->peer, NULL); Aren't these already 0? >+ >+ err = register_netdevice(peer); >+ put_net(net); >+ if (err < 0) >+ goto err_register_peer; >+ netif_carrier_off(peer); >+ if (mode == NETKIT_L2) >+ dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL); >+ >+ err = rtnl_configure_link(peer, NULL, 0, NULL); >+ if (err < 0) >+ goto err_configure_peer; >+ >+ if (mode == NETKIT_L2) >+ eth_hw_addr_random(dev); >+ if (tb[IFLA_IFNAME]) >+ nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ); >+ else >+ strscpy(dev->name, "nk%d", IFNAMSIZ); >+ >+ nk = netkit_priv(dev); >+ nk->primary = true; >+ nk->policy = default_prim; >+ nk->mode = mode; >+ bpf_mprog_bundle_init(&nk->bundle); >+ RCU_INIT_POINTER(nk->active, NULL); >+ RCU_INIT_POINTER(nk->peer, NULL); >+ >+ err = register_netdevice(dev); >+ if (err < 0) >+ goto err_configure_peer; >+ netif_carrier_off(dev); >+ if (mode == NETKIT_L2) >+ dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL); >+ >+ rcu_assign_pointer(netkit_priv(dev)->peer, peer); >+ rcu_assign_pointer(netkit_priv(peer)->peer, dev); >+ return 0; >+err_configure_peer: >+ unregister_netdevice(peer); >+ return err; >+err_register_peer: >+ free_netdev(peer); >+ return err; >+} >+ [..]