Currently, the only way to attach information to a sk_buff that travels through the network stack is by using the mark field. This 32-bit field is highly versatile - it can be read in firewall rules, drive routing decisions, and be accessed by BPF programs. However, its limited capacity creates competition for bits, restricting its practical use. To remedy this, we propose using part of the packet headroom to store metadata. This would allow: - Tracing packets through the network stack and across the kernel-user space boundary, by assigning them a unique ID. - Metadata-driven packet redirection, routing, and socket steering with early classification in XDP. - Extracting information from encapsulation headers and sharing it with user space or vice versa. - Exposing XDP RX Metadata, like the timestamp, to the rest of the network stack. We originally proposed extending XDP metadata - binary blob storage also in the headroom - to expose it throughout the network stack. However based on feedback at LPC 2024 [1]: - sharing a binary blob amongst different applications is hard. - exposing a binary blob to userspace is awkward. we've shifted to a limited KV store in the headroom. To differentiate this from the overloaded "metadata" term, it's tentatively called "packet traits". A get() / set() / delete() API is exposed to BPF to store and retrieve traits. Initial benchmarks in XDP are promising, with get() / set() comparable to an indirect function call. See patch 6: "trait: Replace memmove calls with inline move" for full results. We imagine adding first class support for this in netfilter (setting / checking traits in rules) and routing (selecting routing tables based on traits) in follow up work. We also envisage a first class userspace API for storing and retrieving traits in the future. To co-exist with the existing XDP metadata area, traits are stored at the start of the headroom: | xdp_frame | traits | headroom | XDP metadata | data / packet | Traits and XDP metadata are not allowed to overlap. Like XDP metadata, this relies on there being sufficient headroom available. Piggy backing on top of that work, traits are currently only supported: - On ingress. - By NIC drivers that support XDP metadata. - When an XDP program is attached. This limits the applicability of traits. But future work guaranteeing sufficient headroom through other means should allow these restrictions to be lifted. There are still a number of open questions: - What sizes of values should be allowed? See patch 1 "trait: limited KV store for packet metadata". - How should we handle skb clones? See patch 16 "trait: Support sk_buffs". - How should trait keys be allocated? See patch 18 "trait: registration API". - How should traits work with GRO? Could an API let us specify policies for how traits should be merged? See patch 18 "trait: registration API". [1] https://lpc.events/event/18/contributions/1935/ Cc: jakub@xxxxxxxxxxxxxx Cc: hawk@xxxxxxxxxx Cc: yan@xxxxxxxxxxxxxx Cc: jbrandeburg@xxxxxxxxxxxxxx Cc: thoiland@xxxxxxxxxx Cc: lbiancon@xxxxxxxxxx To: netdev@xxxxxxxxxxxxxxx To: bpf@xxxxxxxxxxxxxxx Signed-off-by: Arthur Fabre <afabre@xxxxxxxxxxxxxx> --- Arthur Fabre (19): trait: limited KV store for packet metadata trait: XDP support trait: basic XDP selftest trait: basic XDP benchmark trait: Replace memcpy calls with inline copies trait: Replace memmove calls with inline move xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions trait: Propagate presence of traits to sk_buff bnxt: Propagate trait presence to skb ice: Propagate trait presence to skb veth: Propagate trait presence to skb virtio_net: Propagate trait presence to skb mlx5: Propagate trait presence to skb xdp generic: Propagate trait presence to skb trait: Support sk_buffs trait: Allow socket filters to access traits trait: registration API trait: Sync linux/bpf.h to tools/ for trait registration trait: register traits in benchmarks and tests Jesper Dangaard Brouer (1): mlx5: move xdp_buff scope one level up drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 + drivers/net/ethernet/intel/ice/ice_txrx.c | 4 + drivers/net/ethernet/intel/ice/ice_xsk.c | 2 + drivers/net/ethernet/mellanox/mlx5/core/en.h | 6 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 6 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h | 6 +- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 114 ++++---- drivers/net/veth.c | 4 + drivers/net/virtio_net.c | 8 +- include/linux/bpf-netns.h | 12 + include/linux/skbuff.h | 33 ++- include/net/net_namespace.h | 6 + include/net/netns/trait.h | 22 ++ include/net/trait.h | 288 +++++++++++++++++++++ include/net/xdp.h | 42 ++- include/uapi/linux/bpf.h | 26 ++ kernel/bpf/net_namespace.c | 54 ++++ kernel/bpf/syscall.c | 26 ++ kernel/bpf/verifier.c | 39 ++- net/core/dev.c | 1 + net/core/filter.c | 43 ++- net/core/skbuff.c | 25 +- net/core/xdp.c | 50 ++++ tools/include/uapi/linux/bpf.h | 26 ++ tools/testing/selftests/bpf/Makefile | 2 + tools/testing/selftests/bpf/bench.c | 11 + tools/testing/selftests/bpf/bench.h | 1 + .../selftests/bpf/benchs/bench_xdp_traits.c | 191 ++++++++++++++ .../testing/selftests/bpf/prog_tests/xdp_traits.c | 51 ++++ .../testing/selftests/bpf/progs/bench_xdp_traits.c | 131 ++++++++++ .../testing/selftests/bpf/progs/test_xdp_traits.c | 94 +++++++ 31 files changed, 1259 insertions(+), 69 deletions(-) --- base-commit: 42ba8a49d085e0c2ad50fb9a8ec954c9762b6e01 change-id: 20250305-afabre-traits-010-rfc2-a8e4de0c490b Best regards, -- Arthur Fabre <afabre@xxxxxxxxxxxxxx>