Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Thu, 26 Sep 2024 14:41:16 +0200

Arthur Fabre <afabre@xxxxxxxxxxxxxx> writes:

> On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>> FYI, we also had a discussion related to this at LPC on Friday, in this
>> session: https://lpc.events/event/18/contributions/1935/
>>
>> The context here was that Arthur and Jakub want to also support extended
>> rich metadata all the way through the SKB path, and are looking at the
>> same area used for XDP metadata to store it. So there's a need to manage
>> both the kernel's own usage of that area, and userspace/BPF usage of it.
>>
>> I'll try to summarise some of the points of that discussion (all
>> interpretations are my own, of course):
>>
>> - We want something that can be carried with a frame all the way from
>>   the XDP layer, through all SKB layers and to userspace (to replace the
>>   use of skb->mark for this purpose).
>>
>> - We want different applications running on the system (of which the
>>   kernel itself if one, cf this discussion) to be able to share this
>>   field, without having to have an out of band registry (like a Github
>>   repository where applications can agree on which bits to use). Which
>>   probably means that the kernel needs to be in the loop somehow to
>>   explicitly allocate space in the metadata area and track offsets.
>>
>> - Having an explicit API to access this from userspace, without having
>>   to go through BPF (i.e., a socket- or CMSG-based API) would be useful.
>>
>
> Thanks for looping us in, and the great summary Toke!

You're welcome :)

>> The TLV format was one of the suggestions in Arthur and Jakub's talk,
>> but AFAICT, there was not a lot of enthusiasm about this in the room
>> (myself included), because of the parsing overhead and complexity. I
>> believe the alternative that was seen as most favourable was a map
>> lookup-style API, where applications can request a metadata area of
>> arbitrary size and get an ID assigned that they can then use to set/get
>> values in the data path.
>>
>> So, sketching this out, this could be realised by something like:
>>
>> /* could be called from BPF, or through netlink or sysfs; may fail, if
>>  * there is no more space
>>  */
>> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));
>>
>> The ID is just an opaque identifier that can then be passed to
>> getter/setter functions (for both SKB and XDP), like:
>>
>> ret = bpf_set_packet_metadata_field(pkt, metadata_id,
>>                                     &my_meta_value, sizeof(my_meta_value))
>>
>> ret = bpf_get_packet_metadata_field(pkt, metadata_id,
>>                                     &my_meta_value, sizeof(my_meta_value))
>>
>>
>> On the kernel side, the implementation would track registered fields in
>> a global structure somewhere, say:
>>
>> struct pkt_metadata_entry {
>>   int id;
>>   u8 sz;
>>   u8 offset;
>>   u8 bit;
>> };
>>
>> struct pkt_metadata_registry { /* allocated as a system-wide global */
>>   u8 num_entries;
>>   u8 total_size;
>>   struct pkt_metadata_entry entries[MAX_ENTRIES];
>> };
>>
>> struct xdp_rx_meta { /* at then end of xdp_frame */
>>   u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
>>   u8 fields_set; /* bitmap of fields that have been set, see below */
>>   u8 data[];
>> };
>>
>> int register_packet_metadata_field(u8 size) {
>>   struct pkt_metadata_registry *reg = get_global_registry();
>>   struct pkt_metadata_entry *entry;
>>
>>   if (size + reg->total_size > MAX_METADATA_SIZE)
>>     return -ENOSPC;
>>
>>   entry = &reg->entries[reg->num_entries++];
>>   entry->id = assign_id();
>>   entry->sz = size;
>>   entry->offset = reg->total_size;
>>   entry->bit = reg->num_entries - 1;
>>   reg->total_size += size;
>>
>>   return entry->id;
>> }
>>
>> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
>>                                   *value, size_t sz)
>> {
>>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>>
>>   if (!entry)
>>     return -ENOENT;
>>
>>   if (entry->sz != sz)
>>     return -EINVAL; /* user error */
>>
>>   if (frm->rx_meta.sz < entry->offset + sz)
>>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>>
>>   memcpy(&frm->rx_meta.data + entry->offset, value, sz);
>>   frm->rx_meta.fields_set |= BIT(entry->bit);
>>
>>   return 0;
>> }
>>
>> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
>>                                   *value, size_t sz)
>> {
>>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>>
>>   if (!entry)
>>     return -ENOENT;
>>
>>   if (entry->sz != sz)
>>     return -EINVAL;
>>
>> if (frm->rx_meta.sz < entry->offset + sz)
>>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>>
>>   if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
>>     return -ENOENT;
>>
>>   memcpy(value, &frm->rx_meta.data + entry->offset, sz);
>>
>>   return 0;
>> }
>>
>> I'm hinting at some complications here (with the EFAULT return) that
>> needs to be resolved: there is no guarantee that a given packet will be
>> in sync with the current status of the registered metadata, so we need
>> explicit checks for this. If metadata entries are de-registered again
>> this also means dealing with holes and/or reshuffling the metadata
>> layout to reuse the released space (incidentally, this is the one place
>> where a TLV format would have advantages).
>>
>> The nice thing about an API like this, though, is that it's extensible,
>> and the kernel itself can be just another consumer of it for the
>> metadata fields Lorenzo is adding in this series. I.e., we could just
>> pre-define some IDs for metadata vlan, timestamp etc, and use the same
>> functions as above from within the kernel to set and get those values;
>> using the registry, there could even be an option to turn those off if
>> an application wants more space for its own usage. Or, alternatively, we
>> could keep the kernel-internal IDs hardcoded and always allocated, and
>> just use the getter/setter functions as the BPF API for accessing them.
>
> That's exactly what I'm thinking of too, a simple API like:
>
> get(u8 key, u8 len, void *val);
> set(u8 key, u8 len, void *val);
>
> With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
>
> If a NIC doesn't support a certain well-known metadata, the key
> wouldn't be set, and get() would return ENOENT.
>
> I think this also lets us avoid having to "register" keys or bits of
> metadata with the kernel.
> We'd reserve some number of keys for hardware metadata.

Right, but how do you allocate space/offset for each key without an
explicit allocation step? You'd basically have to encode the list of IDs
in the metadata area itself, which implies a TLV format that you have to
walk on every access? The registry idea in my example above was
basically to avoid that...

> The remaining keys would be up to users. They'd have to allocate keys
> to services, and configure services to use those keys.
> This is similar to the way listening on a certain port works: only one
> service can use port 80 or 443, and that can typically beconfigured in
> a service's config file.

Right, well, port numbers *do* actually have an out of band service
registry (IANA), which I thought was what we wanted to avoid? ;)

> This side-steps the whole question of how to change the registered
> metadata for in-flight packets, and how to deal with different NICs
> with different hardware metadata.
>
> I think I've figured out a suitable encoding format, hopefully we'll
> have an RFC soon!

Alright, cool!

-Toke