Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes: > This work adds a new, minimal BPF-programmable device called "netkit" > (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The > core idea is that BPF programs are executed within the drivers xmit routine > and therefore e.g. in case of containers/Pods moving BPF processing closer > to the source. > > One of the goals was that in case of Pod egress traffic, this allows to > move BPF programs from hostns tcx ingress into the device itself, providing > earlier drop or forward mechanisms, for example, if the BPF program > determines that the skb must be sent out of the node, then a redirect to > the physical device can take place directly without going through per-CPU > backlog queue. This helps to shift processing for such traffic from softirq > to process context, leading to better scheduling decisions/performance (see > measurements in the slides). > > In this initial version, the netkit device ships as a pair, but we plan to > extend this further so it can also operate in single device mode. The pair > comes with a primary and a peer device. Only the primary device, typically > residing in hostns, can manage BPF programs for itself and its peer. The > peer device is designated for containers/Pods and cannot attach/detach > BPF programs. Upon the device creation, the user can set the default policy > to 'forward' or 'drop' for the case when no BPF program is attached. Nit: according to the code the policies are 'pass' and 'drop'? :) > Additionally, the device can be operated in L3 (default) or L2 mode. The > management of BPF programs is done via bpf_mprog, so that multi-attach is > supported right from the beginning with similar API and dependency controls > as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic > attach/detach/query API for multi-progs"). tc BPF compatibility is provided, > so that existing programs can be easily migrated. > > Going forward, we plan to use netkit devices in Cilium as the main device > type for connecting Pods. They will be operated in L3 mode in order to > simplify a Pod's neighbor management and the peer will operate in default > drop mode, so that no traffic is leaving between the time when a Pod is > brought up by the CNI plugin and programs attached by the agent. > Additionally, the programs we attach via tcx on the physical devices are > using bpf_redirect_peer() for inbound traffic into netkit device, hence the > latter is also supporting the ndo_get_peer_dev callback. Similarly, we use > bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device > directly. Also, BIG TCP is supported on netkit device. For the follow-up > work in single device mode, we plan to convert Cilium's cilium_host/_net > devices into a single one. > > An extensive test suite for checking device operations and the BPF program > and link management API comes as BPF selftests in this series. > > Co-developed-by: Nikolay Aleksandrov <razor@xxxxxxxxxxxxx> > Signed-off-by: Nikolay Aleksandrov <razor@xxxxxxxxxxxxx> > Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > Link: https://github.com/borkmann/iproute2/tree/pr/netkit > Link: > http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf > (24ff.) I like the new name - thank you for changing it! :) Reviewed-by: Toke Høiland-Jørgensen <toke@xxxxxxxxxx>