Arthur Fabre <afabre@xxxxxxxxxxxxxx> writes: > On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> FYI, we also had a discussion related to this at LPC on Friday, in this >> session: https://lpc.events/event/18/contributions/1935/ >> >> The context here was that Arthur and Jakub want to also support extended >> rich metadata all the way through the SKB path, and are looking at the >> same area used for XDP metadata to store it. So there's a need to manage >> both the kernel's own usage of that area, and userspace/BPF usage of it. >> >> I'll try to summarise some of the points of that discussion (all >> interpretations are my own, of course): >> >> - We want something that can be carried with a frame all the way from >> the XDP layer, through all SKB layers and to userspace (to replace the >> use of skb->mark for this purpose). >> >> - We want different applications running on the system (of which the >> kernel itself if one, cf this discussion) to be able to share this >> field, without having to have an out of band registry (like a Github >> repository where applications can agree on which bits to use). Which >> probably means that the kernel needs to be in the loop somehow to >> explicitly allocate space in the metadata area and track offsets. >> >> - Having an explicit API to access this from userspace, without having >> to go through BPF (i.e., a socket- or CMSG-based API) would be useful. >> > > Thanks for looping us in, and the great summary Toke! You're welcome :) >> The TLV format was one of the suggestions in Arthur and Jakub's talk, >> but AFAICT, there was not a lot of enthusiasm about this in the room >> (myself included), because of the parsing overhead and complexity. I >> believe the alternative that was seen as most favourable was a map >> lookup-style API, where applications can request a metadata area of >> arbitrary size and get an ID assigned that they can then use to set/get >> values in the data path. >> >> So, sketching this out, this could be realised by something like: >> >> /* could be called from BPF, or through netlink or sysfs; may fail, if >> * there is no more space >> */ >> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta)); >> >> The ID is just an opaque identifier that can then be passed to >> getter/setter functions (for both SKB and XDP), like: >> >> ret = bpf_set_packet_metadata_field(pkt, metadata_id, >> &my_meta_value, sizeof(my_meta_value)) >> >> ret = bpf_get_packet_metadata_field(pkt, metadata_id, >> &my_meta_value, sizeof(my_meta_value)) >> >> >> On the kernel side, the implementation would track registered fields in >> a global structure somewhere, say: >> >> struct pkt_metadata_entry { >> int id; >> u8 sz; >> u8 offset; >> u8 bit; >> }; >> >> struct pkt_metadata_registry { /* allocated as a system-wide global */ >> u8 num_entries; >> u8 total_size; >> struct pkt_metadata_entry entries[MAX_ENTRIES]; >> }; >> >> struct xdp_rx_meta { /* at then end of xdp_frame */ >> u8 sz; /* set to pkt_metadata_registry->total_size on alloc */ >> u8 fields_set; /* bitmap of fields that have been set, see below */ >> u8 data[]; >> }; >> >> int register_packet_metadata_field(u8 size) { >> struct pkt_metadata_registry *reg = get_global_registry(); >> struct pkt_metadata_entry *entry; >> >> if (size + reg->total_size > MAX_METADATA_SIZE) >> return -ENOSPC; >> >> entry = ®->entries[reg->num_entries++]; >> entry->id = assign_id(); >> entry->sz = size; >> entry->offset = reg->total_size; >> entry->bit = reg->num_entries - 1; >> reg->total_size += size; >> >> return entry->id; >> } >> >> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void >> *value, size_t sz) >> { >> struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); >> >> if (!entry) >> return -ENOENT; >> >> if (entry->sz != sz) >> return -EINVAL; /* user error */ >> >> if (frm->rx_meta.sz < entry->offset + sz) >> return -EFAULT; /* entry allocated after xdp_frame was initialised */ >> >> memcpy(&frm->rx_meta.data + entry->offset, value, sz); >> frm->rx_meta.fields_set |= BIT(entry->bit); >> >> return 0; >> } >> >> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void >> *value, size_t sz) >> { >> struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); >> >> if (!entry) >> return -ENOENT; >> >> if (entry->sz != sz) >> return -EINVAL; >> >> if (frm->rx_meta.sz < entry->offset + sz) >> return -EFAULT; /* entry allocated after xdp_frame was initialised */ >> >> if (!(frm->rx_meta.fields_set & BIT(entry->bit))) >> return -ENOENT; >> >> memcpy(value, &frm->rx_meta.data + entry->offset, sz); >> >> return 0; >> } >> >> I'm hinting at some complications here (with the EFAULT return) that >> needs to be resolved: there is no guarantee that a given packet will be >> in sync with the current status of the registered metadata, so we need >> explicit checks for this. If metadata entries are de-registered again >> this also means dealing with holes and/or reshuffling the metadata >> layout to reuse the released space (incidentally, this is the one place >> where a TLV format would have advantages). >> >> The nice thing about an API like this, though, is that it's extensible, >> and the kernel itself can be just another consumer of it for the >> metadata fields Lorenzo is adding in this series. I.e., we could just >> pre-define some IDs for metadata vlan, timestamp etc, and use the same >> functions as above from within the kernel to set and get those values; >> using the registry, there could even be an option to turn those off if >> an application wants more space for its own usage. Or, alternatively, we >> could keep the kernel-internal IDs hardcoded and always allocated, and >> just use the getter/setter functions as the BPF API for accessing them. > > That's exactly what I'm thinking of too, a simple API like: > > get(u8 key, u8 len, void *val); > set(u8 key, u8 len, void *val); > > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata. > > If a NIC doesn't support a certain well-known metadata, the key > wouldn't be set, and get() would return ENOENT. > > I think this also lets us avoid having to "register" keys or bits of > metadata with the kernel. > We'd reserve some number of keys for hardware metadata. Right, but how do you allocate space/offset for each key without an explicit allocation step? You'd basically have to encode the list of IDs in the metadata area itself, which implies a TLV format that you have to walk on every access? The registry idea in my example above was basically to avoid that... > The remaining keys would be up to users. They'd have to allocate keys > to services, and configure services to use those keys. > This is similar to the way listening on a certain port works: only one > service can use port 80 or 443, and that can typically beconfigured in > a service's config file. Right, well, port numbers *do* actually have an out of band service registry (IANA), which I thought was what we wanted to avoid? ;) > This side-steps the whole question of how to change the registered > metadata for in-flight packets, and how to deal with different NICs > with different hardware metadata. > > I think I've figured out a suitable encoding format, hopefully we'll > have an RFC soon! Alright, cool! -Toke