Re: XDP-hints: Howto support multiple BTF types per packet basis?

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 27 May 2021 10:44:00 -0700

On Wed, May 26, 2021 at 5:44 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
>
> [...]
>
> > > Best to start with the simplest possible usable thing and get more
> > > complex over time.
> > >
> > > For a C definition I would expect drivers to do something like this,
> > >
> > >  struct mynic_rx_descriptor {
> > >         __u64 len;
> > >         __u64 head;
> > >         __u64 tail;
> > >         __u64 foobar;
> > >  }
> > >
> > >  struct mynic_metadata {
> > >         __u64 timestamp;
> > >         __u64 hash;
> > >         __u64 pkt_type;
> > >         struct mynic_rx_descriptor *ptr_to_rx;
> > >         /* other things */
> > >  }
> > >
> > > It doesn't really matter how the driver folks generate their metadata
> > > though. They might use some non-C thing that is more natural for
> > > writing parser/action/tcam codes.
> > >
> > > Anyways given some C block like above we generate BTF from above
> > > using normal method, quick hack just `pahole -J` the thing. Now we
> > > have a BTF file.
> > >
> > > Next up write some XDP program to do something with it,
> > >
> > >  void myxdp_prog(struct xdp_md *ctx) {
> > >         struct mynic_metadata m = (struct mynic_metadata *)ctx->data_meta;
> > >
> > >         // now I can get data using normal CO-RE
> > >         // I usually have this _(&) to put CO-RE attributes in I
> > >         // believe that is standard? Or use the other macros
> > >         __u64 pkt_type = _(&m->pkt_type)
> >
> > add __attribute__((preserve_access_index)) to the struct
> > mynic_metadata above (when compiling your BPF program) and you don't
> > need _() ugliness:
>
> +1. Although sometimes I like the ugliness so I can keep track
> of whats in CO-RE and not.

Oh, I'm just against using underscore as an identifier, I'd use
something a bit more explicit.

>
> >
> > __u64 pkt_type = m->pkt_type; /* it's CO-RE relocatable already */
> >
> > we have preserve_access_index as a code block (some selftests do this)
> > for cases when you can't annotate types
> >
> > >
> > >         // we can even walk into structs if we have probe read
> > >         // around.
> > >         struct mynic_rx_descriptor *rxdesc = _(&m->ptr_to_rx)
> > >
> > >         // now do whatever I like with above metadata
> > >  }
> > >
> > > Run above program through normal CO-RE pass and as long as it has
> > > access to the BTF from above it will work. I have some logic
> > > sitting around to stitch two BTF blocks together but we have
> > > that now done properly for linking.
> >
> > "stitching BTF blocks together" sort of jumped out of nowhere, what is
> > this needed for? And not sure what "BTF block" means exactly, it's a
> > new terminology.
>
> I didn't know what the correct terminology here would be.

I just wasn't sure if "BTF block" is a single BTF type or it's a
collection of types built on top of vmlinux BTF (what we call split
BTF). Seems like it's the latter.

>
> What I meant is I think what you have here,
>
> "
>  BTW, not that I encourage such abuse, but for the experiment's sake,
>  you can (ab)use module BTFs mechanism today to allow dynamically
>  adding/removing split BTFs built on top of kernel (vmlinux) BTF
> "
>
> So if vendor/driver writer has a BTF file for whatever the current
> hardware is doing we can use the split BTF build mechanism to
> include it. This can be used to get Jespers dynamic reprogram
> hardware example. We just need someway to get the BTF of the
> current running hardware. What I'm suggesting to get going we
> can just take that out of band, libbpf/kernel don't have
> to care where it comes from as long as libbpf can consume the
> split BTFs before doing CO-RE.
>
> With this model I can have a single XDP program and it will
> run on multiple hardware or the same hardware across updates
> when I can use the normal CO-RE macros to access the metadata.
> When I update my hardware I just need to get ahold of the
> BTF when I do that update and my programs will continue to
> work.
>
> Once we show the value of above we can talk about a driver
> mechanism to expose the BTF over some interface, maybe in
> /sys/fs. But that would still look like a split BTF from libbpf
> side. The advantage is it should work today.

Right, except I don't think we have libbpf APIs to specify this, but
that's solvable.

>
> I called the process of taking two BTF files, vmlinux BTF and
> user provided NIC metadata BTF, and using those for CO-RE
> logic "stitching BTF blocks together".
>
> >
> > >
> > > probe_read from XDP should be added regardless of above. I've
> > > found it super handy in skmsg programs to dig out kernel info
> > > inline. With probe_read we can also start to walk net_device
> > > struct for more detailed info as needed. Or into sock structs
> >
> > yes, libbpf provides BPF_CORE_READ() macro that allows to walk across
> > struct referenced by pointers, e.g.,:
> >
> > int my_data = BPF_CORE_READ(m, ptr_to_rx, rx_field);
> >
> > is logical equivalent of
> >
> > int my_data = m->ptr_to_rx->rx_field;
>
> The only complication here is ptr_to_rx is outside XDP data
> so we need XDP program to support probe_read(). So depending
> on current capabilities a BPF program might be limited to
> just its own data block or with higher caps able to use
> more of the features.
>

Right.

> >
> > > for process level conntrack (other thread). Even without
> > > probe_read above would be useful but fields would need to fit
> > > into the metadata where we know we can read/write data.
> > >
> > > Having drivers export their BTF over a /sys/fs/ interface
> > > so that BTF can change with fimware/parser updates is possible
> > > as well, but I would want to see above working in real world
> > > before committing to a /sys/fs interface. Anyways the
> > > interface is just a convienence.
> >
> > it's important enough to discuss because libbpf has to get it somehow
> > (or be directly provided as an extra option or something).
>
> I believe to start with directly providing it is the easiest
> approach. Then as a second step we can pull it from a /sys/fs
> interface.
>
> >
> > >
> > > >
> > > > As for BTF on a per-packet basis. This means that BTF itself is not
> > > > known at the BPF program verification time, so there will be some sort
> > > > of if/else if/else conditions to handle all recognized BTF IDs? Is
> > > > that right? Fake but specific code would help (at least me) to
> > > > actually join the discussion. Thanks.
> > >
> > > I don't think we actually want per-packet data that sounds a bit
> > > clumsy for me. Lets use a union and define it so that we have a
> > > single BTF.
> >
> > union and independent set of BTFs are two different things, I'll let
> > you guys figure out which one you need, but I replied how it could
> > look like in CO-RE world
>
> I think a union is sufficient and more aligned with how the
> hardware would actually work.

Sure. And I think those are two orthogonal concerns. You can start
with a single struct mynic_metadata with union inside it, and later
add the ability to swap mynic_metadata with another
mynic_metadata___v2 that will have a similar union but with a
different layout.

>
> >
> > >
> > >  struct mynic_metadata {
> > >   __u64 pkt_type
> > >   union {
> > >       struct ipv6_meta meta;
> > >       struct ipv4_meta meta;
> > >       struct arp_meta meta;
> >
> > obviously fields can't be named the same, so you'll have meta_ipv6,
>
> Sure just typing a quick example.
>
> > meta_ipv4, meta_arp fields, but I get the idea. This works if BTF
> > layout is set in stone. What Jesper proposes would allow to adds new
> > BTF layouts at runtime and still be able to handle that (as in detect
> > and ignore) with already running BPF programs.
>
> Same answer as above. As long as the BTF can be split into two
> files I don't think libbpf should care if its always the same
> NIC.btf + vmlinux.btf or diffent correct?
>
> >
> > CO-RE is sufficiently sophisticated to handle both today, so I don't care :)
>
> +1
>
> >
> > >   }
> > >  };
> > >
> > > Then program has to swivel on pkt_type but that is most natural
> > > C thing to do IMO.
> > >
> > > Honestly we have about 90% of the necessary bits to do this now.
> > > Typed that up a bit fast hope its legible. Got a lot going on today.
> > >
> > > Andrii, make sense?
> >
> > Yes, thanks! The logistics of getting that BTF to libbpf is the most
> > fuzzy area and not worked out completely. The low-level details of
> > relocations are already in place if libbpf can be pointed to the right
> > set of BTF types.
>
> Per above, getting that BTF to libbpf should be a user problem for
> a bit. Once how those programs look is worked out I think drivers
> can push them out via /sys/kernel/btf/mynic
>
> >
> > BTW, not that I encourage such abuse, but for the experiment's sake,
> > you can (ab)use module BTFs mechanism today to allow dynamically
> > adding/removing split BTFs built on top of kernel (vmlinux) BTF. I
> > suggest looking into how module BTFs are handled both inside the
> > kernel and in libbpf.
> >
>
> Exactly the abuse I was thinking ;)