Re: XDP-hints: Howto support multiple BTF types per packet basis?

John Fastabend <john.fastabend@xxxxxxxxx> · Wed, 26 May 2021 13:31:26 -0700

Andrii Nakryiko wrote:
> On Wed, May 26, 2021 at 3:59 AM Jesper Dangaard Brouer
> <brouer@xxxxxxxxxx> wrote:
> >
> > Hi All,
> >
> > I see a need for a driver to use different XDP metadata layout on a per
> > packet basis. E.g. PTP packets contains a hardware timestamp. E.g. VLAN
> > offloading and associated metadata as only relevant for packets using
> > VLANs. (Reserving room for every possible HW-hint is against the idea
> > of BTF).
> >
> > The question is how to support multiple BTF types on per packet basis?
> > (I need input from BTF experts, to tell me if I'm going in the wrong
> > direction with below ideas).
> 
> I'm trying to follow all three threads, but still, can someone dumb it

Thanks ;)

> down for me and use few very specific examples to show how all this is
> supposed to work end-to-end. I.e., how the C definition for those
> custom BTF layouts might look like and how they are used in BPF
> programs, etc. I'm struggling to put all the pieces together, even
> ignoring all the netdev-specific configuration questions.

Best to start with the simplest possible usable thing and get more
complex over time.

For a C definition I would expect drivers to do something like this,

 struct mynic_rx_descriptor {
	__u64 len;
        __u64 head;
	__u64 tail;
	__u64 foobar;
 }

 struct mynic_metadata {
	__u64 timestamp;
	__u64 hash;
	__u64 pkt_type;
	struct mynic_rx_descriptor *ptr_to_rx;
	/* other things */
 }

It doesn't really matter how the driver folks generate their metadata
though. They might use some non-C thing that is more natural for
writing parser/action/tcam codes.

Anyways given some C block like above we generate BTF from above
using normal method, quick hack just `pahole -J` the thing. Now we
have a BTF file.

Next up write some XDP program to do something with it,

 void myxdp_prog(struct xdp_md *ctx) {
	struct mynic_metadata m = (struct mynic_metadata *)ctx->data_meta;	

	// now I can get data using normal CO-RE
	// I usually have this _(&) to put CO-RE attributes in I
        // believe that is standard? Or use the other macros
	__u64 pkt_type = _(&m->pkt_type)

        // we can even walk into structs if we have probe read
        // around.
        struct mynic_rx_descriptor *rxdesc = _(&m->ptr_to_rx)

        // now do whatever I like with above metadata
 }

Run above program through normal CO-RE pass and as long as it has
access to the BTF from above it will work. I have some logic
sitting around to stitch two BTF blocks together but we have
that now done properly for linking.

probe_read from XDP should be added regardless of above. I've
found it super handy in skmsg programs to dig out kernel info
inline. With probe_read we can also start to walk net_device
struct for more detailed info as needed. Or into sock structs
for process level conntrack (other thread). Even without
probe_read above would be useful but fields would need to fit
into the metadata where we know we can read/write data.

Having drivers export their BTF over a /sys/fs/ interface
so that BTF can change with fimware/parser updates is possible
as well, but I would want to see above working in real world
before committing to a /sys/fs interface. Anyways the
interface is just a convienence.

> 
> As for BTF on a per-packet basis. This means that BTF itself is not
> known at the BPF program verification time, so there will be some sort
> of if/else if/else conditions to handle all recognized BTF IDs? Is
> that right? Fake but specific code would help (at least me) to
> actually join the discussion. Thanks.

I don't think we actually want per-packet data that sounds a bit
clumsy for me. Lets use a union and define it so that we have a
single BTF.

 struct mynic_metadata {
  __u64 pkt_type
  union {
      struct ipv6_meta meta; 
      struct ipv4_meta meta;
      struct arp_meta meta;
  }
 };

Then program has to swivel on pkt_type but that is most natural
C thing to do IMO.

Honestly we have about 90% of the necessary bits to do this now.
Typed that up a bit fast hope its legible. Got a lot going on today.

Andrii, make sense?

Thanks,
John
> 
> >
> > Let me describe a possible/proposed packet flow (feel free to disagree):
> >
> >  When driver RX e.g. a PTP packet it knows HW is configured for PTP-TS and
> >  when it sees a TS is available, then it chooses a code path that use the
> >  BTF layout that contains RX-TS. To communicate what BTF-type the
> >  XDP-metadata contains, it simply store the BTF-ID in xdp_buff->btf_id.
> >
> >  When redirecting the xdp_buff is converted to xdp_frame, and also contains
> >  the btf_id member. When converting xdp_frame to SKB, then netcore-code
> >  checks if this BTF-ID have been registered, if so there is a (callback or
> >  BPF-hook) registered to handle this BTF-type that transfer the fields from
> >  XDP-metadata area into SKB fields.
> >
> >  The XDP-prog also have access to this ctx->btf_id and can multiplex on
> >  this in the BPF-code itself. Or use other methods like parsing PTP packet
> >  and extract TS as expected BTF offset in XDP metadata (perhaps add a
> >  sanity check if metadata-size match).
> >
> >
> > I talked to AF_XDP people (Magnus, Bjørn and William) about this idea,
> > and they pointed out that AF_XDP also need to know what BTF-layout is
> > used. As Magnus wrote in other thread; there is only 32-bit left in
> > AF_XDP descriptor option. We could store the BTF-ID in this field, but
> > it would block for other use-cases. Bjørn came up with the idea of
> > storing the BTF-ID in the BTF-layout itself, but as the last-member (to
> > have fixed offset to check in userspace AF_XDP program). Then we only
> > need to use a single bit in AF_XDP descriptor option to say
> > XDP-metadata is BTF described.
> >
> > In the AF_XDP userspace program, the programmers can have a similar
> > callback system per known BTF-ID. This way they can compile efficient
> > code per ID via requesting the BTF layout from the kernel. (Hint:
> > `bpftool btf dump id 42 format c`).
> >
> > Please let me know if this it the right or wrong direction?
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >