Re: XDP-hints: Howto support multiple BTF types per packet basis?

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 26 May 2021 15:54:59 -0700

On Wed, May 26, 2021 at 1:31 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
>
> Andrii Nakryiko wrote:
> > On Wed, May 26, 2021 at 3:59 AM Jesper Dangaard Brouer
> > <brouer@xxxxxxxxxx> wrote:
> > >
> > > Hi All,
> > >
> > > I see a need for a driver to use different XDP metadata layout on a per
> > > packet basis. E.g. PTP packets contains a hardware timestamp. E.g. VLAN
> > > offloading and associated metadata as only relevant for packets using
> > > VLANs. (Reserving room for every possible HW-hint is against the idea
> > > of BTF).
> > >
> > > The question is how to support multiple BTF types on per packet basis?
> > > (I need input from BTF experts, to tell me if I'm going in the wrong
> > > direction with below ideas).
> >
> > I'm trying to follow all three threads, but still, can someone dumb it
>
> Thanks ;)
>
> > down for me and use few very specific examples to show how all this is
> > supposed to work end-to-end. I.e., how the C definition for those
> > custom BTF layouts might look like and how they are used in BPF
> > programs, etc. I'm struggling to put all the pieces together, even
> > ignoring all the netdev-specific configuration questions.
>
> Best to start with the simplest possible usable thing and get more
> complex over time.
>
> For a C definition I would expect drivers to do something like this,
>
>  struct mynic_rx_descriptor {
>         __u64 len;
>         __u64 head;
>         __u64 tail;
>         __u64 foobar;
>  }
>
>  struct mynic_metadata {
>         __u64 timestamp;
>         __u64 hash;
>         __u64 pkt_type;
>         struct mynic_rx_descriptor *ptr_to_rx;
>         /* other things */
>  }
>
> It doesn't really matter how the driver folks generate their metadata
> though. They might use some non-C thing that is more natural for
> writing parser/action/tcam codes.
>
> Anyways given some C block like above we generate BTF from above
> using normal method, quick hack just `pahole -J` the thing. Now we
> have a BTF file.
>
> Next up write some XDP program to do something with it,
>
>  void myxdp_prog(struct xdp_md *ctx) {
>         struct mynic_metadata m = (struct mynic_metadata *)ctx->data_meta;
>
>         // now I can get data using normal CO-RE
>         // I usually have this _(&) to put CO-RE attributes in I
>         // believe that is standard? Or use the other macros
>         __u64 pkt_type = _(&m->pkt_type)

add __attribute__((preserve_access_index)) to the struct
mynic_metadata above (when compiling your BPF program) and you don't
need _() ugliness:

__u64 pkt_type = m->pkt_type; /* it's CO-RE relocatable already */

we have preserve_access_index as a code block (some selftests do this)
for cases when you can't annotate types

>
>         // we can even walk into structs if we have probe read
>         // around.
>         struct mynic_rx_descriptor *rxdesc = _(&m->ptr_to_rx)
>
>         // now do whatever I like with above metadata
>  }
>
> Run above program through normal CO-RE pass and as long as it has
> access to the BTF from above it will work. I have some logic
> sitting around to stitch two BTF blocks together but we have
> that now done properly for linking.

"stitching BTF blocks together" sort of jumped out of nowhere, what is
this needed for? And not sure what "BTF block" means exactly, it's a
new terminology.

>
> probe_read from XDP should be added regardless of above. I've
> found it super handy in skmsg programs to dig out kernel info
> inline. With probe_read we can also start to walk net_device
> struct for more detailed info as needed. Or into sock structs

yes, libbpf provides BPF_CORE_READ() macro that allows to walk across
struct referenced by pointers, e.g.,:

int my_data = BPF_CORE_READ(m, ptr_to_rx, rx_field);

is logical equivalent of

int my_data = m->ptr_to_rx->rx_field;

> for process level conntrack (other thread). Even without
> probe_read above would be useful but fields would need to fit
> into the metadata where we know we can read/write data.
>
> Having drivers export their BTF over a /sys/fs/ interface
> so that BTF can change with fimware/parser updates is possible
> as well, but I would want to see above working in real world
> before committing to a /sys/fs interface. Anyways the
> interface is just a convienence.

it's important enough to discuss because libbpf has to get it somehow
(or be directly provided as an extra option or something).

>
> >
> > As for BTF on a per-packet basis. This means that BTF itself is not
> > known at the BPF program verification time, so there will be some sort
> > of if/else if/else conditions to handle all recognized BTF IDs? Is
> > that right? Fake but specific code would help (at least me) to
> > actually join the discussion. Thanks.
>
> I don't think we actually want per-packet data that sounds a bit
> clumsy for me. Lets use a union and define it so that we have a
> single BTF.

union and independent set of BTFs are two different things, I'll let
you guys figure out which one you need, but I replied how it could
look like in CO-RE world

>
>  struct mynic_metadata {
>   __u64 pkt_type
>   union {
>       struct ipv6_meta meta;
>       struct ipv4_meta meta;
>       struct arp_meta meta;

obviously fields can't be named the same, so you'll have meta_ipv6,
meta_ipv4, meta_arp fields, but I get the idea. This works if BTF
layout is set in stone. What Jesper proposes would allow to adds new
BTF layouts at runtime and still be able to handle that (as in detect
and ignore) with already running BPF programs.

CO-RE is sufficiently sophisticated to handle both today, so I don't care :)

>   }
>  };
>
> Then program has to swivel on pkt_type but that is most natural
> C thing to do IMO.
>
> Honestly we have about 90% of the necessary bits to do this now.
> Typed that up a bit fast hope its legible. Got a lot going on today.
>
> Andrii, make sense?

Yes, thanks! The logistics of getting that BTF to libbpf is the most
fuzzy area and not worked out completely. The low-level details of
relocations are already in place if libbpf can be pointed to the right
set of BTF types.

BTW, not that I encourage such abuse, but for the experiment's sake,
you can (ab)use module BTFs mechanism today to allow dynamically
adding/removing split BTFs built on top of kernel (vmlinux) BTF. I
suggest looking into how module BTFs are handled both inside the
kernel and in libbpf.

>
> Thanks,
> John
> >
> > >
> > > Let me describe a possible/proposed packet flow (feel free to disagree):
> > >
> > >  When driver RX e.g. a PTP packet it knows HW is configured for PTP-TS and
> > >  when it sees a TS is available, then it chooses a code path that use the
> > >  BTF layout that contains RX-TS. To communicate what BTF-type the
> > >  XDP-metadata contains, it simply store the BTF-ID in xdp_buff->btf_id.
> > >
> > >  When redirecting the xdp_buff is converted to xdp_frame, and also contains
> > >  the btf_id member. When converting xdp_frame to SKB, then netcore-code
> > >  checks if this BTF-ID have been registered, if so there is a (callback or
> > >  BPF-hook) registered to handle this BTF-type that transfer the fields from
> > >  XDP-metadata area into SKB fields.
> > >
> > >  The XDP-prog also have access to this ctx->btf_id and can multiplex on
> > >  this in the BPF-code itself. Or use other methods like parsing PTP packet
> > >  and extract TS as expected BTF offset in XDP metadata (perhaps add a
> > >  sanity check if metadata-size match).
> > >
> > >
> > > I talked to AF_XDP people (Magnus, Bjørn and William) about this idea,
> > > and they pointed out that AF_XDP also need to know what BTF-layout is
> > > used. As Magnus wrote in other thread; there is only 32-bit left in
> > > AF_XDP descriptor option. We could store the BTF-ID in this field, but
> > > it would block for other use-cases. Bjørn came up with the idea of
> > > storing the BTF-ID in the BTF-layout itself, but as the last-member (to
> > > have fixed offset to check in userspace AF_XDP program). Then we only
> > > need to use a single bit in AF_XDP descriptor option to say
> > > XDP-metadata is BTF described.
> > >
> > > In the AF_XDP userspace program, the programmers can have a similar
> > > callback system per known BTF-ID. This way they can compile efficient
> > > code per ID via requesting the BTF layout from the kernel. (Hint:
> > > `bpftool btf dump id 42 format c`).
> > >
> > > Please let me know if this it the right or wrong direction?
> > >
> > > --
> > > Best regards,
> > >   Jesper Dangaard Brouer
> > >   MSc.CS, Principal Kernel Engineer at Red Hat
> > >   LinkedIn: http://www.linkedin.com/in/brouer
> > >
>
>