Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Sun, 22 Sep 2024 13:12:51 +0200

Lorenzo Bianconi <lorenzo.bianconi@xxxxxxxxxx> writes:

>> 
>> 
>> On 21/09/2024 22.17, Alexander Lobakin wrote:
>> > From: Lorenzo Bianconi <lorenzo@xxxxxxxxxx>
>> > Date: Sat, 21 Sep 2024 18:52:56 +0200
>> > 
>> > > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame
>> > 
>> > &xdp_buff is on the stack.
>> > &xdp_frame consumes headroom.
>> > 
>> > IOW they're size-sensitive and putting metadata directly there might
>> > play bad; if not now, then later.
>> > 
>> > Our idea (me + Toke) was as follows:
>> > 
>> > - new BPF kfunc to build generic meta. If called, the driver builds a
>> >    generic meta with hash, csum etc., in the data_meta area.
>> 
>> I do agree that it should be the XDP prog (via a new BPF kfunc) that
>> decide if xdp_frame should be updated to contain a generic meta struct.
>> *BUT* I think we should use the xdp_frame area, and not the
>> xdp->data_meta area.
>
> ack, I will add a new kfunc for it.
>
>> 
>> A details is that I think this kfunc should write data directly into
>> xdp_frame area, even then we are only operating on the xdp_buff, as we
>> do have access to the area xdp_frame will be created in.
>
> this would avoid to copy it when we convert from xdp_buff to xdp_frame, nice :)
>
>> 
>> 
>> When using data_meta area, then netstack encap/decap needs to move the
>> data_meta area (extra cycles).  The xdp_frame area (live in top) don't
>> have this issue.
>> 
>> It is easier to allow xdp_frame area to survive longer together with the
>> SKB. Today we "release" this xdp_frame area to be used by SKB for extra
>> headroom (see xdp_scrub_frame).  I can imagine that we can move SKB
>> fields to this area, and reduce the size of the SKB alloc. (This then
>> becomes the mini-SKB we discussed a couple of years ago).
>> 
>> 
>> >    Yes, this also consumes headroom, but only when the corresponding func
>> >    is called. Introducing new fields like you're doing will consume it
>> >    unconditionally;
>> 
>> We agree on the kfunc call marks area as consumed/in-use.  We can extend
>> xdp_frame statically like Lorenzo does (with struct xdp_rx_meta), but
>> xdp_frame->flags can be used for marking this area as used or not.
>
> the only downside with respect to a TLV approach would be to consume all the
> xdp_rx_meta as soon as we set a single xdp rx hw hint in it, right?
> The upside is it is easier and it requires less instructions.

FYI, we also had a discussion related to this at LPC on Friday, in this
session: https://lpc.events/event/18/contributions/1935/

The context here was that Arthur and Jakub want to also support extended
rich metadata all the way through the SKB path, and are looking at the
same area used for XDP metadata to store it. So there's a need to manage
both the kernel's own usage of that area, and userspace/BPF usage of it.

I'll try to summarise some of the points of that discussion (all
interpretations are my own, of course):

- We want something that can be carried with a frame all the way from
  the XDP layer, through all SKB layers and to userspace (to replace the
  use of skb->mark for this purpose).

- We want different applications running on the system (of which the
  kernel itself if one, cf this discussion) to be able to share this
  field, without having to have an out of band registry (like a Github
  repository where applications can agree on which bits to use). Which
  probably means that the kernel needs to be in the loop somehow to
  explicitly allocate space in the metadata area and track offsets.

- Having an explicit API to access this from userspace, without having
  to go through BPF (i.e., a socket- or CMSG-based API) would be useful.

The TLV format was one of the suggestions in Arthur and Jakub's talk,
but AFAICT, there was not a lot of enthusiasm about this in the room
(myself included), because of the parsing overhead and complexity. I
believe the alternative that was seen as most favourable was a map
lookup-style API, where applications can request a metadata area of
arbitrary size and get an ID assigned that they can then use to set/get
values in the data path.

So, sketching this out, this could be realised by something like:

/* could be called from BPF, or through netlink or sysfs; may fail, if
 * there is no more space
 */
int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));

The ID is just an opaque identifier that can then be passed to
getter/setter functions (for both SKB and XDP), like:

ret = bpf_set_packet_metadata_field(pkt, metadata_id,
                                    &my_meta_value, sizeof(my_meta_value))

ret = bpf_get_packet_metadata_field(pkt, metadata_id,
                                    &my_meta_value, sizeof(my_meta_value))

On the kernel side, the implementation would track registered fields in
a global structure somewhere, say:

struct pkt_metadata_entry {
  int id;
  u8 sz;
  u8 offset;
  u8 bit;
};

struct pkt_metadata_registry { /* allocated as a system-wide global */
  u8 num_entries;
  u8 total_size;
  struct pkt_metadata_entry entries[MAX_ENTRIES];
};

struct xdp_rx_meta { /* at then end of xdp_frame */
  u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
  u8 fields_set; /* bitmap of fields that have been set, see below */
  u8 data[];
};

int register_packet_metadata_field(u8 size) {
  struct pkt_metadata_registry *reg = get_global_registry();
  struct pkt_metadata_entry *entry;

  if (size + reg->total_size > MAX_METADATA_SIZE)
    return -ENOSPC;

  entry = &reg->entries[reg->num_entries++];
  entry->id = assign_id();
  entry->sz = size;
  entry->offset = reg->total_size;
  entry->bit = reg->num_entries - 1;
  reg->total_size += size;

  return entry->id;
}

int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
                                  *value, size_t sz)
{
  struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);

  if (!entry)
    return -ENOENT;

  if (entry->sz != sz)
    return -EINVAL; /* user error */

  if (frm->rx_meta.sz < entry->offset + sz)
    return -EFAULT; /* entry allocated after xdp_frame was initialised */

  memcpy(&frm->rx_meta.data + entry->offset, value, sz);
  frm->rx_meta.fields_set |= BIT(entry->bit);

  return 0;
}

int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
                                  *value, size_t sz)
{
  struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);

  if (!entry)
    return -ENOENT;

  if (entry->sz != sz)
    return -EINVAL;

if (frm->rx_meta.sz < entry->offset + sz)
    return -EFAULT; /* entry allocated after xdp_frame was initialised */

  if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
    return -ENOENT;

  memcpy(value, &frm->rx_meta.data + entry->offset, sz);

  return 0;
}

I'm hinting at some complications here (with the EFAULT return) that
needs to be resolved: there is no guarantee that a given packet will be
in sync with the current status of the registered metadata, so we need
explicit checks for this. If metadata entries are de-registered again
this also means dealing with holes and/or reshuffling the metadata
layout to reuse the released space (incidentally, this is the one place
where a TLV format would have advantages).

The nice thing about an API like this, though, is that it's extensible,
and the kernel itself can be just another consumer of it for the
metadata fields Lorenzo is adding in this series. I.e., we could just
pre-define some IDs for metadata vlan, timestamp etc, and use the same
functions as above from within the kernel to set and get those values;
using the registry, there could even be an option to turn those off if
an application wants more space for its own usage. Or, alternatively, we
could keep the kernel-internal IDs hardcoded and always allocated, and
just use the getter/setter functions as the BPF API for accessing them.

-Toke