Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

Arthur Fabre <afabre@xxxxxxxxxxxxxx> · Thu, 26 Sep 2024 13:31:35 +0200

On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> FYI, we also had a discussion related to this at LPC on Friday, in this
> session: https://lpc.events/event/18/contributions/1935/
>
> The context here was that Arthur and Jakub want to also support extended
> rich metadata all the way through the SKB path, and are looking at the
> same area used for XDP metadata to store it. So there's a need to manage
> both the kernel's own usage of that area, and userspace/BPF usage of it.
>
> I'll try to summarise some of the points of that discussion (all
> interpretations are my own, of course):
>
> - We want something that can be carried with a frame all the way from
>   the XDP layer, through all SKB layers and to userspace (to replace the
>   use of skb->mark for this purpose).
>
> - We want different applications running on the system (of which the
>   kernel itself if one, cf this discussion) to be able to share this
>   field, without having to have an out of band registry (like a Github
>   repository where applications can agree on which bits to use). Which
>   probably means that the kernel needs to be in the loop somehow to
>   explicitly allocate space in the metadata area and track offsets.
>
> - Having an explicit API to access this from userspace, without having
>   to go through BPF (i.e., a socket- or CMSG-based API) would be useful.
>

Thanks for looping us in, and the great summary Toke!

> The TLV format was one of the suggestions in Arthur and Jakub's talk,
> but AFAICT, there was not a lot of enthusiasm about this in the room
> (myself included), because of the parsing overhead and complexity. I
> believe the alternative that was seen as most favourable was a map
> lookup-style API, where applications can request a metadata area of
> arbitrary size and get an ID assigned that they can then use to set/get
> values in the data path.
>
> So, sketching this out, this could be realised by something like:
>
> /* could be called from BPF, or through netlink or sysfs; may fail, if
>  * there is no more space
>  */
> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));
>
> The ID is just an opaque identifier that can then be passed to
> getter/setter functions (for both SKB and XDP), like:
>
> ret = bpf_set_packet_metadata_field(pkt, metadata_id,
>                                     &my_meta_value, sizeof(my_meta_value))
>
> ret = bpf_get_packet_metadata_field(pkt, metadata_id,
>                                     &my_meta_value, sizeof(my_meta_value))
>
>
> On the kernel side, the implementation would track registered fields in
> a global structure somewhere, say:
>
> struct pkt_metadata_entry {
>   int id;
>   u8 sz;
>   u8 offset;
>   u8 bit;
> };
>
> struct pkt_metadata_registry { /* allocated as a system-wide global */
>   u8 num_entries;
>   u8 total_size;
>   struct pkt_metadata_entry entries[MAX_ENTRIES];
> };
>
> struct xdp_rx_meta { /* at then end of xdp_frame */
>   u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
>   u8 fields_set; /* bitmap of fields that have been set, see below */
>   u8 data[];
> };
>
> int register_packet_metadata_field(u8 size) {
>   struct pkt_metadata_registry *reg = get_global_registry();
>   struct pkt_metadata_entry *entry;
>
>   if (size + reg->total_size > MAX_METADATA_SIZE)
>     return -ENOSPC;
>
>   entry = &reg->entries[reg->num_entries++];
>   entry->id = assign_id();
>   entry->sz = size;
>   entry->offset = reg->total_size;
>   entry->bit = reg->num_entries - 1;
>   reg->total_size += size;
>
>   return entry->id;
> }
>
> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
>                                   *value, size_t sz)
> {
>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>
>   if (!entry)
>     return -ENOENT;
>
>   if (entry->sz != sz)
>     return -EINVAL; /* user error */
>
>   if (frm->rx_meta.sz < entry->offset + sz)
>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>
>   memcpy(&frm->rx_meta.data + entry->offset, value, sz);
>   frm->rx_meta.fields_set |= BIT(entry->bit);
>
>   return 0;
> }
>
> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
>                                   *value, size_t sz)
> {
>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>
>   if (!entry)
>     return -ENOENT;
>
>   if (entry->sz != sz)
>     return -EINVAL;
>
> if (frm->rx_meta.sz < entry->offset + sz)
>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>
>   if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
>     return -ENOENT;
>
>   memcpy(value, &frm->rx_meta.data + entry->offset, sz);
>
>   return 0;
> }
>
> I'm hinting at some complications here (with the EFAULT return) that
> needs to be resolved: there is no guarantee that a given packet will be
> in sync with the current status of the registered metadata, so we need
> explicit checks for this. If metadata entries are de-registered again
> this also means dealing with holes and/or reshuffling the metadata
> layout to reuse the released space (incidentally, this is the one place
> where a TLV format would have advantages).
>
> The nice thing about an API like this, though, is that it's extensible,
> and the kernel itself can be just another consumer of it for the
> metadata fields Lorenzo is adding in this series. I.e., we could just
> pre-define some IDs for metadata vlan, timestamp etc, and use the same
> functions as above from within the kernel to set and get those values;
> using the registry, there could even be an option to turn those off if
> an application wants more space for its own usage. Or, alternatively, we
> could keep the kernel-internal IDs hardcoded and always allocated, and
> just use the getter/setter functions as the BPF API for accessing them.

That's exactly what I'm thinking of too, a simple API like:

get(u8 key, u8 len, void *val);
set(u8 key, u8 len, void *val);

With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.

If a NIC doesn't support a certain well-known metadata, the key
wouldn't be set, and get() would return ENOENT.

I think this also lets us avoid having to "register" keys or bits of
metadata with the kernel.
We'd reserve some number of keys for hardware metadata.

The remaining keys would be up to users. They'd have to allocate keys
to services, and configure services to use those keys.
This is similar to the way listening on a certain port works: only one
service can use port 80 or 443, and that can typically beconfigured in
a service's config file.

This side-steps the whole question of how to change the registered
metadata for in-flight packets, and how to deal with different NICs
with different hardware metadata.

I think I've figured out a suitable encoding format, hopefully we'll have an
RFC soon!

> -Toke
>