Andrii Nakryiko wrote: > On Wed, May 26, 2021 at 3:59 AM Jesper Dangaard Brouer > <brouer@xxxxxxxxxx> wrote: > > > > Hi All, > > > > I see a need for a driver to use different XDP metadata layout on a per > > packet basis. E.g. PTP packets contains a hardware timestamp. E.g. VLAN > > offloading and associated metadata as only relevant for packets using > > VLANs. (Reserving room for every possible HW-hint is against the idea > > of BTF). > > > > The question is how to support multiple BTF types on per packet basis? > > (I need input from BTF experts, to tell me if I'm going in the wrong > > direction with below ideas). > > I'm trying to follow all three threads, but still, can someone dumb it Thanks ;) > down for me and use few very specific examples to show how all this is > supposed to work end-to-end. I.e., how the C definition for those > custom BTF layouts might look like and how they are used in BPF > programs, etc. I'm struggling to put all the pieces together, even > ignoring all the netdev-specific configuration questions. Best to start with the simplest possible usable thing and get more complex over time. For a C definition I would expect drivers to do something like this, struct mynic_rx_descriptor { __u64 len; __u64 head; __u64 tail; __u64 foobar; } struct mynic_metadata { __u64 timestamp; __u64 hash; __u64 pkt_type; struct mynic_rx_descriptor *ptr_to_rx; /* other things */ } It doesn't really matter how the driver folks generate their metadata though. They might use some non-C thing that is more natural for writing parser/action/tcam codes. Anyways given some C block like above we generate BTF from above using normal method, quick hack just `pahole -J` the thing. Now we have a BTF file. Next up write some XDP program to do something with it, void myxdp_prog(struct xdp_md *ctx) { struct mynic_metadata m = (struct mynic_metadata *)ctx->data_meta; // now I can get data using normal CO-RE // I usually have this _(&) to put CO-RE attributes in I // believe that is standard? Or use the other macros __u64 pkt_type = _(&m->pkt_type) // we can even walk into structs if we have probe read // around. struct mynic_rx_descriptor *rxdesc = _(&m->ptr_to_rx) // now do whatever I like with above metadata } Run above program through normal CO-RE pass and as long as it has access to the BTF from above it will work. I have some logic sitting around to stitch two BTF blocks together but we have that now done properly for linking. probe_read from XDP should be added regardless of above. I've found it super handy in skmsg programs to dig out kernel info inline. With probe_read we can also start to walk net_device struct for more detailed info as needed. Or into sock structs for process level conntrack (other thread). Even without probe_read above would be useful but fields would need to fit into the metadata where we know we can read/write data. Having drivers export their BTF over a /sys/fs/ interface so that BTF can change with fimware/parser updates is possible as well, but I would want to see above working in real world before committing to a /sys/fs interface. Anyways the interface is just a convienence. > > As for BTF on a per-packet basis. This means that BTF itself is not > known at the BPF program verification time, so there will be some sort > of if/else if/else conditions to handle all recognized BTF IDs? Is > that right? Fake but specific code would help (at least me) to > actually join the discussion. Thanks. I don't think we actually want per-packet data that sounds a bit clumsy for me. Lets use a union and define it so that we have a single BTF. struct mynic_metadata { __u64 pkt_type union { struct ipv6_meta meta; struct ipv4_meta meta; struct arp_meta meta; } }; Then program has to swivel on pkt_type but that is most natural C thing to do IMO. Honestly we have about 90% of the necessary bits to do this now. Typed that up a bit fast hope its legible. Got a lot going on today. Andrii, make sense? Thanks, John > > > > > Let me describe a possible/proposed packet flow (feel free to disagree): > > > > When driver RX e.g. a PTP packet it knows HW is configured for PTP-TS and > > when it sees a TS is available, then it chooses a code path that use the > > BTF layout that contains RX-TS. To communicate what BTF-type the > > XDP-metadata contains, it simply store the BTF-ID in xdp_buff->btf_id. > > > > When redirecting the xdp_buff is converted to xdp_frame, and also contains > > the btf_id member. When converting xdp_frame to SKB, then netcore-code > > checks if this BTF-ID have been registered, if so there is a (callback or > > BPF-hook) registered to handle this BTF-type that transfer the fields from > > XDP-metadata area into SKB fields. > > > > The XDP-prog also have access to this ctx->btf_id and can multiplex on > > this in the BPF-code itself. Or use other methods like parsing PTP packet > > and extract TS as expected BTF offset in XDP metadata (perhaps add a > > sanity check if metadata-size match). > > > > > > I talked to AF_XDP people (Magnus, Bjørn and William) about this idea, > > and they pointed out that AF_XDP also need to know what BTF-layout is > > used. As Magnus wrote in other thread; there is only 32-bit left in > > AF_XDP descriptor option. We could store the BTF-ID in this field, but > > it would block for other use-cases. Bjørn came up with the idea of > > storing the BTF-ID in the BTF-layout itself, but as the last-member (to > > have fixed offset to check in userspace AF_XDP program). Then we only > > need to use a single bit in AF_XDP descriptor option to say > > XDP-metadata is BTF described. > > > > In the AF_XDP userspace program, the programmers can have a similar > > callback system per known BTF-ID. This way they can compile efficient > > code per ID via requesting the BTF layout from the kernel. (Hint: > > `bpftool btf dump id 42 format c`). > > > > Please let me know if this it the right or wrong direction? > > > > -- > > Best regards, > > Jesper Dangaard Brouer > > MSc.CS, Principal Kernel Engineer at Red Hat > > LinkedIn: http://www.linkedin.com/in/brouer > >