Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu Sep 26, 2024 at 2:41 PM CEST, Toke Høiland-Jørgensen wrote:
> Arthur Fabre <afabre@xxxxxxxxxxxxxx> writes:
>
> > On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >> FYI, we also had a discussion related to this at LPC on Friday, in this
> >> session: https://lpc.events/event/18/contributions/1935/
> >>
> >> The context here was that Arthur and Jakub want to also support extended
> >> rich metadata all the way through the SKB path, and are looking at the
> >> same area used for XDP metadata to store it. So there's a need to manage
> >> both the kernel's own usage of that area, and userspace/BPF usage of it.
> >>
> >> I'll try to summarise some of the points of that discussion (all
> >> interpretations are my own, of course):
> >>
> >> - We want something that can be carried with a frame all the way from
> >>   the XDP layer, through all SKB layers and to userspace (to replace the
> >>   use of skb->mark for this purpose).
> >>
> >> - We want different applications running on the system (of which the
> >>   kernel itself if one, cf this discussion) to be able to share this
> >>   field, without having to have an out of band registry (like a Github
> >>   repository where applications can agree on which bits to use). Which
> >>   probably means that the kernel needs to be in the loop somehow to
> >>   explicitly allocate space in the metadata area and track offsets.
> >>
> >> - Having an explicit API to access this from userspace, without having
> >>   to go through BPF (i.e., a socket- or CMSG-based API) would be useful.
> >>
> >
> > Thanks for looping us in, and the great summary Toke!
>
> You're welcome :)
>
> >> The TLV format was one of the suggestions in Arthur and Jakub's talk,
> >> but AFAICT, there was not a lot of enthusiasm about this in the room
> >> (myself included), because of the parsing overhead and complexity. I
> >> believe the alternative that was seen as most favourable was a map
> >> lookup-style API, where applications can request a metadata area of
> >> arbitrary size and get an ID assigned that they can then use to set/get
> >> values in the data path.
> >>
> >> So, sketching this out, this could be realised by something like:
> >>
> >> /* could be called from BPF, or through netlink or sysfs; may fail, if
> >>  * there is no more space
> >>  */
> >> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));
> >>
> >> The ID is just an opaque identifier that can then be passed to
> >> getter/setter functions (for both SKB and XDP), like:
> >>
> >> ret = bpf_set_packet_metadata_field(pkt, metadata_id,
> >>                                     &my_meta_value, sizeof(my_meta_value))
> >>
> >> ret = bpf_get_packet_metadata_field(pkt, metadata_id,
> >>                                     &my_meta_value, sizeof(my_meta_value))
> >>
> >>
> >> On the kernel side, the implementation would track registered fields in
> >> a global structure somewhere, say:
> >>
> >> struct pkt_metadata_entry {
> >>   int id;
> >>   u8 sz;
> >>   u8 offset;
> >>   u8 bit;
> >> };
> >>
> >> struct pkt_metadata_registry { /* allocated as a system-wide global */
> >>   u8 num_entries;
> >>   u8 total_size;
> >>   struct pkt_metadata_entry entries[MAX_ENTRIES];
> >> };
> >>
> >> struct xdp_rx_meta { /* at then end of xdp_frame */
> >>   u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
> >>   u8 fields_set; /* bitmap of fields that have been set, see below */
> >>   u8 data[];
> >> };
> >>
> >> int register_packet_metadata_field(u8 size) {
> >>   struct pkt_metadata_registry *reg = get_global_registry();
> >>   struct pkt_metadata_entry *entry;
> >>
> >>   if (size + reg->total_size > MAX_METADATA_SIZE)
> >>     return -ENOSPC;
> >>
> >>   entry = &reg->entries[reg->num_entries++];
> >>   entry->id = assign_id();
> >>   entry->sz = size;
> >>   entry->offset = reg->total_size;
> >>   entry->bit = reg->num_entries - 1;
> >>   reg->total_size += size;
> >>
> >>   return entry->id;
> >> }
> >>
> >> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
> >>                                   *value, size_t sz)
> >> {
> >>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
> >>
> >>   if (!entry)
> >>     return -ENOENT;
> >>
> >>   if (entry->sz != sz)
> >>     return -EINVAL; /* user error */
> >>
> >>   if (frm->rx_meta.sz < entry->offset + sz)
> >>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
> >>
> >>   memcpy(&frm->rx_meta.data + entry->offset, value, sz);
> >>   frm->rx_meta.fields_set |= BIT(entry->bit);
> >>
> >>   return 0;
> >> }
> >>
> >> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
> >>                                   *value, size_t sz)
> >> {
> >>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
> >>
> >>   if (!entry)
> >>     return -ENOENT;
> >>
> >>   if (entry->sz != sz)
> >>     return -EINVAL;
> >>
> >> if (frm->rx_meta.sz < entry->offset + sz)
> >>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
> >>
> >>   if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
> >>     return -ENOENT;
> >>
> >>   memcpy(value, &frm->rx_meta.data + entry->offset, sz);
> >>
> >>   return 0;
> >> }
> >>
> >> I'm hinting at some complications here (with the EFAULT return) that
> >> needs to be resolved: there is no guarantee that a given packet will be
> >> in sync with the current status of the registered metadata, so we need
> >> explicit checks for this. If metadata entries are de-registered again
> >> this also means dealing with holes and/or reshuffling the metadata
> >> layout to reuse the released space (incidentally, this is the one place
> >> where a TLV format would have advantages).
> >>
> >> The nice thing about an API like this, though, is that it's extensible,
> >> and the kernel itself can be just another consumer of it for the
> >> metadata fields Lorenzo is adding in this series. I.e., we could just
> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same
> >> functions as above from within the kernel to set and get those values;
> >> using the registry, there could even be an option to turn those off if
> >> an application wants more space for its own usage. Or, alternatively, we
> >> could keep the kernel-internal IDs hardcoded and always allocated, and
> >> just use the getter/setter functions as the BPF API for accessing them.
> >
> > That's exactly what I'm thinking of too, a simple API like:
> >
> > get(u8 key, u8 len, void *val);
> > set(u8 key, u8 len, void *val);
> >
> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
> >
> > If a NIC doesn't support a certain well-known metadata, the key
> > wouldn't be set, and get() would return ENOENT.
> >
> > I think this also lets us avoid having to "register" keys or bits of
> > metadata with the kernel.
> > We'd reserve some number of keys for hardware metadata.
>
> Right, but how do you allocate space/offset for each key without an
> explicit allocation step? You'd basically have to encode the list of IDs
> in the metadata area itself, which implies a TLV format that you have to
> walk on every access? The registry idea in my example above was
> basically to avoid that...

I've been playing around with having a small fixed header at the front
of the metadata itself, that lets you access values without walking them
all.

Still WIP, and maybe this is too restrictive, but it lets you encode 64
2, 4, or 8 byte KVs with a single 16 byte header:

#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include <errno.h>
#include <string.h>

/**
 * Very limited KV support:
 *
 * - Keys: 0-63. (TBD which are reserved for the kernel)
 * - Value size: 2, 4, or 8.
 * 
 * But compact, uses a fixed 16 byte header only.
 * Reasonably fast access.
 * Insertion and deletion need a memmove (entries have to be sorted), but it's ~100s of bytes of max.
 *
 * TODO - should we support more sizes? Can switch to three bits per entry.
 * TODO - should we support a 0 size to just the presence / absence of key without a value?
 */
bool valid_len(uint8_t len) {
    return len == 2 || len == 4 || len == 8;
}

bool valid_key(uint8_t key) {
    return key < 64;
}

// Fixed header at the start of the meta area.
struct hdr {
    // 2 bit length is stored for each key:
    //  a high bit in the high word.
    //  a low bit in the low word.
    // Key itself is the bit position in each word, LSb 0.
    // This lets us count the bits in high and low to easily
    // calculate the sum of the preceeding KVs lengths.
    uint64_t high;
    uint64_t low;
};

int total_length(struct hdr h) {
    // TODO - is this builtin allowed in kernel code?
    return (__builtin_popcountll(h.high) << 2) + (__builtin_popcountll(h.low) << 1);
}

struct hdr and(struct hdr h, uint64_t mask) {
    return (struct hdr){
        h.high & mask,
        h.low & mask,
    };
}

int offset(struct hdr h, uint8_t key) {
    // Calculate total length of previous keys by masking out keys after.
    return sizeof(struct hdr) + total_length(and(h, ~(~0llu << key)));
}

// TODO - is headroom zero initialized?
#define META_LEN (sizeof(struct hdr) + 128)
uint8_t meta[META_LEN];

int set(uint8_t key, uint8_t len, void *val) {
    if (!valid_key(key)) {
        return -EINVAL;
    }
    if (!valid_len(len)) {
        return -EINVAL;
    }

    struct hdr *h = (struct hdr *)meta;

    // Figure out if we have enough room left: total length of everything now.
    if (sizeof(struct hdr) + total_length(*h) + len > sizeof(meta)) {
        return -ENOMEM;
    }

    // Offset of value of this key.
    int off = offset(*h, key);

    // Memmove all the kvs after us over.
    memmove(meta+off+len, meta+off, sizeof(meta)-off);

    // Set our value.
    memcpy(meta+off, val, len);

    // Store our length in header.
    uint64_t encode_len = 0;
    switch (len) {
    case 2:
        encode_len = 1;
        break;
    case 4:
        encode_len = 2;
        break;
    case 8:
        encode_len = 3;
        break;
    }
    h->high |= (encode_len >> 1) << key;
    h->low |= (encode_len & 1) << key;

    return 0;
}

// Callers need to know the format ahead of time,
// so they'll know the length too.
// We just check the buffer is big enough for the value.
int get(uint8_t key, uint8_t len, void *val) {
    if (!valid_key(key)) {
        return -EINVAL;
    }
    if (!valid_len(len)) {
        return -EINVAL;
    }

    struct hdr h = *(struct hdr *)meta;

    // Check key is set.
    if (!((h.high & (1ull << key)) || (h.low & (1ull << key)))) {
        return -ENOENT;
    }

    // Offset of value of this key.
    int off = offset(h, key);

    // Figure out our length.
    int real_len = total_length(and(h, (1ull << key)));

    if (real_len > len) {
        return -EFBIG;
    }

    memcpy(val, meta+off, real_len);
    return 0;
}

int del(uint8_t key) {
    if (!valid_key(key)) {
        return -EINVAL;
    }

    struct hdr *h = (struct hdr *)meta;

    // Check key is set.
    if (!((h->high & (1ull << key)) || (h->low & (1ull << key)))) {
        return -ENOENT;
    }

    // Offset and length of value of this key.
    int off = offset(*h, key);
    int len = total_length(and(*h, (1ull << key)));

    // Memmove all the kvs after us over.
    memmove(meta+off, meta+off+len, sizeof(meta)-off-len);

    // Clear our length in header.
    h->high &= ~(1ull << key);
    h->low &= ~(1ull << key);
    return 0;
}

>
> > The remaining keys would be up to users. They'd have to allocate keys
> > to services, and configure services to use those keys.
> > This is similar to the way listening on a certain port works: only one
> > service can use port 80 or 443, and that can typically beconfigured in
> > a service's config file.
>
> Right, well, port numbers *do* actually have an out of band service
> registry (IANA), which I thought was what we wanted to avoid? ;)

Depends how you think about it ;)

I think we should avoid a global registry. But having a registry per
deployment / server doesn't seem awful. Services that want to use a
field would have a config file setting to set which index it corresponds
to.
Admins would just have to pick a free one on their system, and set it in
the config file of the service.

This is similar to what we do for non-IANA registered ports internally.
For example each service needs a port on an internal interface to expose
metrics, and we just track which ports are taken in our config
management.

Dynamically registering fields means you have to share the returned ID
with whoever is interested, which sounds tricky.
If an XDP program sets a field like packet_id, every tracing
program that looks at it, and userspace service, would need to know what
the ID of that field is.
Is there a way to easily share that ID with all of them?

>
> > This side-steps the whole question of how to change the registered
> > metadata for in-flight packets, and how to deal with different NICs
> > with different hardware metadata.
> >
> > I think I've figured out a suitable encoding format, hopefully we'll
> > have an RFC soon!
>
> Alright, cool!
>
> -Toke






[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux