Re: XDP-hints: Howto support multiple BTF types per packet basis?

Jesper Dangaard Brouer <brouer@xxxxxxxxxx> · Mon, 31 May 2021 15:17:48 +0200

On Mon, 31 May 2021 13:03:14 +0200
Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:

> John Fastabend <john.fastabend@xxxxxxxxx> writes:
> 
> > Jesper Dangaard Brouer wrote:
> >
> > [...]
> >
> > I'll try to respond to both Toke and Jesper here and make it coherent so
> > we don't split this thread yet again.
> >
> > Wish me luck.
> >
> > @Toke "react" -> "not break" hopefully gives you my opinion on this.
> >
> > @Toke "five fields gives 32 different metadata formats" OK let me take
> > five fields,
> >
> >  struct meta {
> >    __u32 f1;
> >    __u32 f2;
> >    __u32 f3;
> >    __u32 f4;
> >    __u32 f5;
> >  }
> >
> > I'm still confused why the meta data would change just because the feature
> > is enabled or disabled. I've written drivers before and I don't want to
> > move around where I write f1 based on some combination of features f2,f3,f4,f5

Just be be clear:  I'm not suggesting drivers need to dynamically write
at different offsets.  Adding this BTF-flexibility make is possible,
but in practice the driver need to be smart about this.  I think we all
understand this would kill performance and create too complex drivers.

Drivers and hardware have a natural need to place info at fixed offsets.
The hole exercise is to create a layer between drivers and BPF+netstack
via BTF that can express flexibility.  That BTF can express this
flexibility doesn't mean that the drivers/hardware all of a sudden need
to be as flexible.

Drivers will continue to place info at fixed offsets (likely make sense
to have a big struct with unions and everything). I'm suggesting that
the driver will tell the "flex-layer" via BTF which-members-are-what by
setting a BTF-ID that "describe" this.
(Hint, drivers have knowledge like what HW features bits are enabled
that can be translated into a given BTF-ID.  Potentially drivers can
update this BTF-ID when setup events via ethtool occurs)

To avoid the combinations explode, the driver can choose to limit these
via e.g. always include the vlan_id but set it to zero even when
vlan-offloads are turned off.  Drivers can also *choose* to have a
single and very advanced BTF-layout with bitfields and everything (as
the BTF-side is superflexible and support this).  Again the drivers
should be moderate and not implement a combination explosion of BTF-IDs
just because BTF allowed this high level of flexibility.

> > state of enabled/disabled. If features are mutual exclusive I can build a
> > sensible union. If its possible for all fields to enabled then I just lay
> > them out like above.  
> 
> The assumption that the layout would be changing as the features were
> enabled came from a discussion I had with Jesper where he pointed out
> that zeroing out the fields that were not active comes with a measurable
> performance impact. So changing the struct layout to only include the
> fields that are currently used is a way to make sure we don't hurt
> performance.
> 
> If I'm understanding you correctly, what you propose instead is that we
> just keep the struct layout the same and only write the data that we
> have, leaving the other fields as uninitialised (so essentially
> garbage), right?
> 
> If we do this, the BPF program obviously needs to know which fields are
> valid and which are not. AFAICT you're proposing that this should be
> done out-of-band (i.e., by the system administrator manually ensuring
> BPF program config fits system config)? I think there are a couple of
> problems with this:
> 
> - It requires the system admin to coordinate device config with all of
>   their installed XDP applications. This is error-prone, especially as
>   the number of applications grows (say if different containers have
>   different XDP programs installed on their virtual devices).
> 
> - It has synchronisation issues. Say I have an XDP program with optional
>   support for hardware timestamps and a software fallback. It gets
>   installed in software fallback mode; then the admin has to make sure
>   to enable the hardware timestamps before switching the application
>   into the mode where it will read that metadata field (and the opposite
>   order when disabling the hardware mode).

IMHO this synchronization issue is problematic.  E.g. when turning off
HW-timestamping, userspace BPF-application have to be quick, as it need
to disable BPF-prog global-variable BEFORE hardware stops setting
HW-TS, else BPF-prog will think HW-TS is on and read garbage. (There is
a similar issue for in-flight packets when turning this on).

Today enable/disable of HW-TS happens via a socket API. Do you imagine
the BPF-prog need to catch these events (turning HW-TS off) and then
update the BPF-prog global-variable?

> Also, we need to be able to deal with different metadata layouts on
> different packets in the same program. Consider the XDP program
> running on a veth device inside a container above: if this gets
> packets redirected into it from different NICs with different
> layouts, it needs to be able to figure out which packet came from
> where.
> 
> With this in mind I think we have to encode the metadata format into
> the packet metadata itself somehow. This could just be a matter of
> including the BTF ID as part of the struct itself, so that your
> example could essentially do:
> 
>   if (data->meta_btf_id == timestamp_id) {
>     struct timestamp_meta *meta = data->meta_data;
>     // do stuff
>   } else {
>     struct normal_meta *meta = data->meta_data;
>   }
> 
> 
> and then, to avoid drivers having to define different layouts we could
> essentially have the two metadata structs be:
> 
>  struct normal_meta {
>   u32 rxhash;
>   u32 padding;
>   u8 vlan;
>  };
> 
> and
> 
>  struct timestamp_meta {
>    u32 rxhash;
>    u32 timestamp;
>    u8 vlan;
>  };

This aligns well with my above suggestion to name a member differently
like "padding" in above.

Another way to "remove" members is the change the metadata size.
This way the BPF program cannot access it.  Notice, that is why I had
my timestamp info in the top of the struct in my example.  The driver
code is of-cause simply written such that the offsets are not dynamic (I
hope this is also clear to others, else feel free to poke me to explain
better...).

> This still gets us exponential growth in the number of metadata
> structs, but at least we could create tooling to auto-generate the
> BTF for the variants so the complexity is reduced to just consuming a
> lot of BTF IDs.
> 
> Alternatively we could have an explicit bitmap of valid fields, like:
> 
>  struct all_meta {
>    u32 _valid_field_bitmap;
>    u32 rxhash;
>    u32 timestamp;
>    u8 vlan;
>  };
> 
> and if a program reads all_meta->timestamp CO-RE could transparently
> insert a check of the relevant field in the bitmap first. My immediate
> feeling is that this last solution would be nicer: We'd still need to
> include the packet BTF ID in the packet data to deal with case where
> packets are coming from different interfaces, but we'd avoid having
> lots of different variants with separate BTF IDs. I'm not sure what
> it would take to teach CO-RE to support such a scheme, though...
> 
> WDYT?

Keeping above intact, as I (also) want to hear what John thinks.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer