Martin KaFai Lau <martin.lau@xxxxxxxxx> writes: > On 11/10/22 6:19 AM, Toke Høiland-Jørgensen wrote: >> Martin KaFai Lau <martin.lau@xxxxxxxxx> writes: >> >>> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote: >>>> Snipping a bit of context to reply to this bit: >>>> >>>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not >>>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number >>>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after >>>>>>>> calling bpf_xdp_metadata_export_to_skb(). >>>>>>> >>>>>>> What do you think the usecase here might be? Or are you suggesting we >>>>>>> reject further access to data_meta after >>>>>>> bpf_xdp_metadata_export_to_skb somehow? >>>>>>> >>>>>>> If we want to let the programs override some of this >>>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add >>>>>>> more kfuncs instead of exposing the layout? >>>>>>> >>>>>>> bpf_xdp_metadata_export_to_skb(ctx); >>>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234); >>>> >>>> There are several use cases for needing to access the metadata after >>>> calling bpf_xdp_metdata_export_to_skb(): >>>> >>>> - Accessing the metadata after redirect (in a cpumap or devmap program, >>>> or on a veth device) >>>> - Transferring the packet+metadata to AF_XDP >>> fwiw, the xdp prog could also be more selective and only stores one of the hints >>> instead of the whole 'struct xdp_to_skb_metadata'. >> >> Yup, absolutely! In that sense, reusing the SKB format is mostly a >> convenience feature. However, lots of people consume AF_XDP through the >> default program installed by libxdp in the XSK setup code, and including >> custom metadata there is awkward. So having the metadata consumed by the >> stack as the "default subset" would enable easy consumption by >> non-advanced users, while advanced users can still do custom stuff by >> writing their own XDP program that calls the kfuncs. >> >>>> - Returning XDP_PASS, but accessing some of the metadata first (whether >>>> to read or change it) >>>> >>>> The last one could be solved by calling additional kfuncs, but that >>>> would be less efficient than just directly editing the struct which >>>> will be cache-hot after the helper returns. >>> >>> Yeah, it is more efficient to directly write if possible. I think this set >>> allows the direct reading and writing already through data_meta (as a _u8 *). >> >> Yup, totally fine with just keeping that capability :) >> >>>> And yeah, this will allow the XDP program to inject arbitrary metadata >>>> into the netstack; but it can already inject arbitrary *packet* data >>>> into the stack, so not sure if this is much of an additional risk? If it >>>> does lead to trivial crashes, we should probably harden the stack >>>> against that? >>>> >>>> As for the random number, Jesper and I discussed replacing this with the >>>> same BTF-ID scheme that he was using in his patch series. I.e., instead >>>> of just putting in a random number, we insert the BTF ID of the metadata >>>> struct at the end of it. This will allow us to support multiple >>>> different formats in the future (not just changing the layout, but >>>> having multiple simultaneous formats in the same kernel image), in case >>>> we run out of space. >>> >>> This seems a bit hypothetical. How much headroom does it usually have for the >>> xdp prog? Potentially the hints can use all the remaining space left after the >>> header encap and the current bpf_xdp_adjust_meta() usage? >> >> For the metadata consumed by the stack right now it's a bit >> hypothetical, yeah. However, there's a bunch of metadata commonly >> supported by hardware that the stack currently doesn't consume and that >> hopefully this feature will end up making more accessible. My hope is >> that the stack can also learn how to use this in the future, in which >> case we may run out of space. So I think of that bit mostly as >> future-proofing... > > ic. in this case, Can the btf_id be added to 'struct xdp_to_skb_metadata' later > if it is indeed needed? The 'struct xdp_to_skb_metadata' is not in UAPI and > doing it with CO-RE is to give us flexibility to make this kind of changes in > the future. My worry is mostly that it'll be more painful to add it later than just including it from the start, mostly because of AF_XDP users. But if we do the randomisation thing (thus forcing AF_XDP users to deal with the dynamic layout as well), it should be possible to add it later, and I can live with that option as well... >>>> We should probably also have a flag set on the xdp_frame so the stack >>>> knows that the metadata area contains relevant-to-skb data, to guard >>>> against an XDP program accidentally hitting the "magic number" (BTF_ID) >>>> in unrelated stuff it puts into the metadata area. >>> >>> Yeah, I think having a flag is useful. The flag will be set at xdp_buff and >>> then transfer to the xdp_frame? >> >> Yeah, exactly! >> >>>>> After re-reading patch 6, have another question. The 'void >>>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at >>>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *' >>>>> pointer and the xdp prog can directly read (or even write) it? >>>> >>>> Hmm, I'm not sure returning a failure makes sense? Failure to read one >>>> or more fields just means that those fields will not be populated? We >>>> should probably have a flags field inside the metadata struct itself to >>>> indicate which fields are set or not, but I'm not sure returning an >>>> error value adds anything? Returning a pointer to the metadata field >>>> might be convenient for users (it would just be an alias to the >>>> data_meta pointer, but the verifier could know its size, so the program >>>> doesn't have to bounds check it). >>> >>> If some hints are not available, those hints should be initialized to >>> 0/CHECKSUM_NONE/...etc. >> >> The problem with that is that then we have to spend cycles writing >> eight bytes of zeroes into the checksum field :) > > With a common 'struct xdp_to_skb_metadata', I am not sure how some of these zero > writes can be avoided. If the xdp prog wants to optimize, it can call > individual kfunc to get individual hints. Erm, we just... don't write those fields? Something like: void write_skb_meta(hw, ctx) { struct xdp_skb_metadata meta = ctx->data_meta - sizeof(struct xdp_skb_metadata); meta->valid_fields = 0; if (hw_has_timestamp) { meta->timestamp = hw->timestamp; meta->valid_fields |= FIELD_TIMESTAMP; } /* otherwise meta->timestamp is just uninitialised */ if (hw_has_rxhash) { meta->rxhash = hw->rxhash; meta->valid_fields |= FIELD_RXHASH; } /* otherwise meta->rxhash is just uninitialised */ ...etc... } >>> The xdp prog needs a direct way to tell hard failure when it cannot >>> write the meta area because of not enough space. Comparing >>> xdp->data_meta with xdp->data as a side effect is not intuitive. >> >> Yeah, hence a flags field so we can just see if setting each field >> succeeded? > > How testing a flag is different from checking 0/invalid-value of a > field? The flags field is smaller, so we write fewer bytes if not all fields are populated. > or some fields just don't have an invalid value to check for > like vlan_tci? Yeah, that could also be an issue. > You meant a flags field as a return value or in the 'struct xdp_to_skb_metadata' ? See above. >> >>> It is more than saving the bound check. With type info of 'struct >>> xdp_to_skb_metadata *', the verifier can do more checks like reading in the >>> middle of an integer member. The verifier could also limit write access only to >>> a few struct's members if it is needed. >>> >>> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the >>> xdp->data_meta. They should actually point to different locations in the >>> headroom. bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff. >>> xdp->data_meta won't be changed and keeps pointing to the last >>> bpf_xdp_adjust_meta() location. The kernel will know if there is >>> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the >>> xdp_{buff,frame}. Would it work? >> >> Hmm, logically splitting the program metadata and the xdp_hints metadata >> (but having them share the same area) *could* work, I guess, I'm just >> not sure it's worth the extra complexity? > > It shouldn't stop the existing xdp prog writing its own metadata from using the > the new bpf_xdp_metadata_export_to_skb(). Right, I think I see what you mean, and I agree that splitting it internally in the kernel does make sense (see my other reply to Stanislav). >> >>>>> A related question, why 'struct xdp_to_skb_metadata' needs >>>>> __randomize_layout? >>>> >>>> The __randomize_layout thing is there to force BPF programs to use CO-RE >>>> to access the field. This is to avoid the struct layout accidentally >>>> ossifying because people in practice rely on a particular layout, even >>>> though we tell them to use CO-RE. There are lots of examples of this >>>> happening in other domains (IP header options, TCP options, etc), and >>>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :) >>> >>> I am not sure if it is necessary or helpful to only enforce __randomize_layout >>> in 'struct xdp_to_skb_metadata'. There are other CO-RE use cases (tracing and >>> non tracing) that already have direct access (reading and/or writing) to other >>> kernel structures. >>> >>> It is more important for the verifier to see the xdp prog accessing it as a >>> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so >>> that the verifier can enforce the rules of access. >> >> That only works inside the kernel, though. Since the metadata field can >> be copied wholesale to AF_XDP, having it randomized forces userspace >> consumers to also write code to deal with the layout being dynamic... > > hm... I still don't see how useful it is, in particular you mentioned > the libxdp will install a xdp prog to write this default format > (xdp_to_skb_metadata) and likely libxdp will also provide some helpers > to parse the xdp_to_skb_metadata and the libxdp user should not need > to know if CO-RE is used or not. Considering it is a kernel internal > struct, I think it is fine to keep it and can be revisited later if > needed. Lets get on to other things first :) Well, if it was just kernel-internal, sure. But we're also exporting it to userspace (through AF_XDP). Well-behaved users will obviously do the right thing and use CO-RE. I'm trying to guard against users just blindly copy-pasting the struct definition from the kernel and doing a static cast, then complaining about their code breaking the first time we change the struct layout. Which I sadly expect that there absolutely will be people who do unless we actively make sure that doesn't work from the get-go. And since the randomisation is literally just adding __randomize_layout to the struct definition, I'd rather start out with having it, and removing it later if it turns out not to be needed... :) -Toke