Toke Høiland-Jørgensen wrote: > John Fastabend <john.fastabend@xxxxxxxxx> writes: > > > Jesper Dangaard Brouer wrote: > > > > [...] > > > > I'll try to respond to both Toke and Jesper here and make it coherent so > > we don't split this thread yet again. > > > > Wish me luck. > > > > @Toke "react" -> "not break" hopefully gives you my opinion on this. > > > > @Toke "five fields gives 32 different metadata formats" OK let me take > > five fields, > > > > struct meta { > > __u32 f1; > > __u32 f2; > > __u32 f3; > > __u32 f4; > > __u32 f5; > > } > > > > I'm still confused why the meta data would change just because the feature > > is enabled or disabled. I've written drivers before and I don't want to > > move around where I write f1 based on some combination of features f2,f3,f4,f5 > > state of enabled/disabled. If features are mutual exclusive I can build a > > sensible union. If its possible for all fields to enabled then I just lay > > them out like above. > > The assumption that the layout would be changing as the features were > enabled came from a discussion I had with Jesper where he pointed out > that zeroing out the fields that were not active comes with a measurable > performance impact. So changing the struct layout to only include the > fields that are currently used is a way to make sure we don't hurt > performance. > > If I'm understanding you correctly, what you propose instead is that we > just keep the struct layout the same and only write the data that we > have, leaving the other fields as uninitialised (so essentially > garbage), right? Correct. > > If we do this, the BPF program obviously needs to know which fields are > valid and which are not. AFAICT you're proposing that this should be > done out-of-band (i.e., by the system administrator manually ensuring > BPF program config fits system config)? I think there are a couple of > problems with this: > > - It requires the system admin to coordinate device config with all of > their installed XDP applications. This is error-prone, especially as > the number of applications grows (say if different containers have > different XDP programs installed on their virtual devices). A complete "system" will need to be choerent. If I forward into a veth device the orchestration component needs to ensure program sending bits there is using the same format the program installed there expects. If I tailcall/fentry into another program that program the callee and caller need to agree on the metadata protocol. I don't see any way around this. Someone has to manage the network. > > - It has synchronisation issues. Say I have an XDP program with optional > support for hardware timestamps and a software fallback. It gets > installed in software fallback mode; then the admin has to make sure > to enable the hardware timestamps before switching the application > into the mode where it will read that metadata field (and the opposite > order when disabling the hardware mode). If you want tell the hardware to use an enable bit then check it on ingress to the XDP program. What I like about this scheme is as a core kernel or networking person I don't have to care how you do this. Figure it out with your hardware and driver. > > Also, we need to be able to deal with different metadata layouts on > different packets in the same program. Consider the XDP program running > on a veth device inside a container above: if this gets packets > redirected into it from different NICs with different layouts, it needs > to be able to figure out which packet came from where. Again I don't think this is the kernel problem. If you have some setup where NICs have different metadata then you need a XDP shim to rewrite metadata. But, as the person who setup this system maybe you should avoid this for the fast path at least. > > With this in mind I think we have to encode the metadata format into the > packet metadata itself somehow. This could just be a matter of including > the BTF ID as part of the struct itself, so that your example could > essentially do: > > if (data->meta_btf_id == timestamp_id) { > struct timestamp_meta *meta = data->meta_data; > // do stuff > } else { > struct normal_meta *meta = data->meta_data; > } I'm not sure why you call it meta_btf_id there? Why not just enabledTstampBit. so if (mymeta->enabledTstampBit) { ...} else { //fallback } > > > and then, to avoid drivers having to define different layouts we could > essentially have the two metadata structs be: > > struct normal_meta { > u32 rxhash; > u32 padding; > u8 vlan; > }; > > and > > struct timestamp_meta { > u32 rxhash; > u32 timestamp; > u8 vlan; > }; So a union on that u32 with padding and timestamp would suffice but sure if you want to do it at compile time with btf_id ok. I think runtime would make more sense. Like this, struct meta { u32 rxhash; u32 timestamp; u8 vlan; u8 flags; } if (m->flags) { read timestamp} > > This still gets us exponential growth in the number of metadata structs, > but at least we could create tooling to auto-generate the BTF for the > variants so the complexity is reduced to just consuming a lot of BTF > IDs. auto-generate the BTF? I'm not sure why its needed to be honest. I think for most simple things you can build a single metadata struct that will fit. For more complex pipeline updates then we should generate the BTF with the pipeline using P4 or such IMO. > > Alternatively we could have an explicit bitmap of valid fields, like: > > struct all_meta { > u32 _valid_field_bitmap; > u32 rxhash; > u32 timestamp; > u8 vlan; > }; Aha I like this as you can tell from above. Would have just jumped here but was responding as I was reading. So I think I'll leave above maybe it is illustrative. > > and if a program reads all_meta->timestamp CO-RE could transparently > insert a check of the relevant field in the bitmap first. My immediate > feeling is that this last solution would be nicer: We'd still need to +1 I think we agree. > include the packet BTF ID in the packet data to deal with case where > packets are coming from different interfaces, but we'd avoid having lots > of different variants with separate BTF IDs. I'm not sure what it would > take to teach CO-RE to support such a scheme, though... > > WDYT? I think we agree to the last option (bitmap field) to start with build that out I think you cover most hardware that is out today. Then the more complex cases come as step2. And I think its hard to understate that the last step above mostly works today, you don't need to go do a bunch of kernel work. > > -Toke >