On Sep 27, Arthur Fabre wrote: > On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote: > > "Arthur Fabre" <afabre@xxxxxxxxxxxxxx> writes: > > > > >> >> The nice thing about an API like this, though, is that it's extensible, > > >> >> and the kernel itself can be just another consumer of it for the > > >> >> metadata fields Lorenzo is adding in this series. I.e., we could just > > >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same > > >> >> functions as above from within the kernel to set and get those values; > > >> >> using the registry, there could even be an option to turn those off if > > >> >> an application wants more space for its own usage. Or, alternatively, we > > >> >> could keep the kernel-internal IDs hardcoded and always allocated, and > > >> >> just use the getter/setter functions as the BPF API for accessing them. > > >> > > > >> > That's exactly what I'm thinking of too, a simple API like: > > >> > > > >> > get(u8 key, u8 len, void *val); > > >> > set(u8 key, u8 len, void *val); > > >> > > > >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata. > > >> > > > >> > If a NIC doesn't support a certain well-known metadata, the key > > >> > wouldn't be set, and get() would return ENOENT. > > >> > > > >> > I think this also lets us avoid having to "register" keys or bits of > > >> > metadata with the kernel. > > >> > We'd reserve some number of keys for hardware metadata. > > >> > > >> Right, but how do you allocate space/offset for each key without an > > >> explicit allocation step? You'd basically have to encode the list of IDs > > >> in the metadata area itself, which implies a TLV format that you have to > > >> walk on every access? The registry idea in my example above was > > >> basically to avoid that... > > > > > > I've been playing around with having a small fixed header at the front > > > of the metadata itself, that lets you access values without walking them > > > all. > > > > > > Still WIP, and maybe this is too restrictive, but it lets you encode 64 > > > 2, 4, or 8 byte KVs with a single 16 byte header: > > > > Ohh, that's clever, I like it! :) > > > > It's also extensible in the sense that the internal representation can > > change without impacting the API, so if we end up needing more bits we > > can always add those. > > > > Maybe it would be a good idea to make the 'key' parameter a larger > > integer type (function parameters are always 64-bit anyway, so might as > > well go all the way up to u64)? That way we can use higher values for > > the kernel-reserved types instead of reserving part of the already-small > > key space for applications (assuming the kernel-internal data is stored > > somewhere else, like in a static struct as in Lorenzo's patch)? > > Good idea! That makes it more extensible too if we ever support more > keys or bigger lengths. > > I'm not sure where the kernel-reserved types should live. Putting them > in here uses up some the of KV IDs, but it uses less head room space than > always reserving a static struct for them. > Maybe it doesn't matter much, as long as we use the same API to access > them, we could internally switch between one and the other. > > > > > [...] > > > > >> > The remaining keys would be up to users. They'd have to allocate keys > > >> > to services, and configure services to use those keys. > > >> > This is similar to the way listening on a certain port works: only one > > >> > service can use port 80 or 443, and that can typically beconfigured in > > >> > a service's config file. > > >> > > >> Right, well, port numbers *do* actually have an out of band service > > >> registry (IANA), which I thought was what we wanted to avoid? ;) > > > > > > Depends how you think about it ;) > > > > > > I think we should avoid a global registry. But having a registry per > > > deployment / server doesn't seem awful. Services that want to use a > > > field would have a config file setting to set which index it corresponds > > > to. > > > Admins would just have to pick a free one on their system, and set it in > > > the config file of the service. > > > > > > This is similar to what we do for non-IANA registered ports internally. > > > For example each service needs a port on an internal interface to expose > > > metrics, and we just track which ports are taken in our config > > > management. > > > > Right, this would work, but it assumes that applications are > > well-behaved and do this correctly. Which they probably do in a > > centrally-managed system like yours, but for random applications shipped > > by distros, I'm not sure if it's going to work. > > > > In fact, it's more or less the situation we have with skb->mark today, > > isn't it? I.e., applications can co-exist as long as they don't use the > > same bits, so they have to coordinate on which bits to use. Sure, with > > this scheme there will be more total bits to use, which can lessen the > > pressure somewhat, but the basic problem remains. In other words, I > > worry that in practice we will end up with another github repository > > serving as a registry for metadata keys... > > That's true. If applications hardcode keys, we'll be back to having > conflicts. If someone creates a registry on github I'll be very sad. > > (Maybe we can make the verifier enforce that the key passed to get() and > set() isn't a constant? - only joking) > > Applications don't tend to do this for ports though, I think most can be > configured to listen on any port. Is that just because it's been > instilled as "good practice" over time? Could we try to do the same with > some very stern documentation and examples? > > Thinking about it more, my only relectance for a registration API is how > to communicate the ID back to other consumers (our discussion below). > > > > > > Dynamically registering fields means you have to share the returned ID > > > with whoever is interested, which sounds tricky. > > > If an XDP program sets a field like packet_id, every tracing > > > program that looks at it, and userspace service, would need to know what > > > the ID of that field is. > > > Is there a way to easily share that ID with all of them? > > > > Right, so I'll admit this was one of the handwavy bits of my original > > proposal, but I don't think it's unsolvable. You could do something like > > (once, on application initialisation): > > > > __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */ > > bpf_map_update(&shared_application_config, &my_key_index, &my_key); > > > > and then just get the key out of that map from all programs that want to > > use it? > > Passing it out of band works (whether it's via a pinned map like you > described, or through other means like a unix socket or some other > API), it's just more complicated. > > Every consumer also needs to know about that API. That won't work with > standard tools. For example if we set a PACKET_ID KV, maybe we could > give it to pwru so it could track packets using it? > Without registering keys, we could pass it as a cli flag. With > registration, we'd have to have some helper to get the KV ID. > > It also introduces ordering dependencies between the services on > startup, eg packet tracing hooks could only be attached once our XDP > service has registered a PACKET_ID KV, and they could query it's API. > > > > > We could combine such a registration API with your header format, so > > that the registration just becomes a way of allocating one of the keys > > from 0-63 (and the registry just becomes a global copy of the header). > > This would basically amount to moving the "service config file" into the > > kernel, since that seems to be the only common denominator we can rely > > on between BPF applications (as all attempts to write a common daemon > > for BPF management have shown). > > That sounds reasonable. And I guess we'd have set() check the global > registry to enforce that the key has been registered beforehand? > > > > > -Toke > > Thanks for all the feedback! I like this 'fast' KV approach but I guess we should really evaluate its impact on performances (especially for xdp) since, based on the kfunc calls order in the ebpf program, we can have one or multiple memmove/memcpy for each packet, right? Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not so suitable for nic hw metadata since: - it grows backward - it is probably in a different cacheline with respect to xdp_frame - nic hw metadata will not start at fixed and immutable address, but it depends on the running ebpf program What about having something like: - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end of the metadata area :)). Here he can reuse the same KV approach if it is fast - user defined metadata: in the metadata area of the xdp_frame/xdp_buff Regards, Lorenzo
Attachment:
signature.asc
Description: PGP signature