Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN

Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> · Sat, 24 Jun 2023 11:02:26 +0200

On 23/06/2023 19.41, Stanislav Fomichev wrote:
On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:

On 22/06/2023 19.55, Stanislav Fomichev wrote:
On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@xxxxxxxxxx> wrote:

This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)

On 21/06/2023 19.02, Stanislav Fomichev wrote:
For zerocopy mode, tx_desc->addr can point to the arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.

Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).

   From looking at the code, this introduces a socket option for this TX
metadata length (tx_metadata_len).
This implies the same fixed TX metadata size is used for all packets.
Maybe describe this in patch desc.

I was planning to do a proper documentation page once we settle on all
the details (similar to the one we have for rx).

What is the plan for dealing with cases that doesn't populate same/full
TX metadata size ?

Do we need to support that? I was assuming that the TX layout would be
fixed between the userspace and BPF.

I hope you don't mean fixed layout, as the whole point is adding
flexibility and extensibility.

I do mean a fixed layout between the userspace (af_xdp) and devtx program.
At least fixed max size of the metadata. The userspace and the bpf
prog can then use this fixed space to implement some flexibility
(btf_ids, versioned structs, bitmasks, tlv, etc).
If we were to make the metalen vary per packet, we'd have to signal
its size per packet. Probably not worth it?

Existing XDP metadata implementation also expand in a fixed/limited
sized memory area, but communicate size per packet in this area (also
for validation purposes).  BUT for AF_XDP we don't have room for another
pointer or size in the AF_XDP descriptor (see struct xdp_desc).

If every packet would have a different metadata length, it seems like
a nightmare to parse?

No parsing is really needed.  We can simply use BTF IDs and type cast in
BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
see [1] and [2].

It seems we are talking slightly past each-other(?).  Let me rephrase
and reframe the question, what is your *plan* for dealing with different
*types* of TX metadata.  The different struct *types* will of-cause have
different sizes, but that is okay as long as they fit into the maximum
size set by this new socket option XDP_TX_METADATA_LEN.
Thus, in principle I'm fine with XSK having configured a fixed headroom
for metadata, but we need a plan for handling more than one type and
perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
data ("leftover" since last time this mem was used).

Yeah, I think the above correctly catches my expectation here. Some
headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
offloaded to the bpf program via btf_id/tlv/etc.

Regarding leftover metadata: can we assume the userspace will take
care of setting it up?

With this kfunc approach, then things in-principle, becomes a contract
between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
components can as illustrated here [1]+[2] can coordinate based on local
BPF-prog BTF IDs.  This approach works as-is today, but patchset
selftests examples don't use this and instead have a very static
approach (that people will copy-paste).

An unsolved problem with TX-hook is that it can also get packets from
XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
BPF-prog know if metadata is valid and intended to be used for e.g.
requesting the timestamp? (imagine metadata size happen to match)

My assumption was the bpf program can do ifindex/netns filtering. Plus
maybe check that the meta_len is the one that's expected.
Will that be enough to handle XDP_REDIRECT?

I don't think so, using the meta_len (+ ifindex/netns) to communicate
activation of TX hardware hints is too weak and not enough.  This is an
implicit API for BPF-programmers to understand and can lead to implicit
activation.

Think about what will happen for your AF_XDP send use-case.  For
performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
is fixed even if not used (and can contain garbage), it can by accident
create hard-to-debug situations.  As discussed with Magnus+Maryam
before, we found it was practical (and faster than mem zero) to extend
AF_XDP descriptor (see struct xdp_desc) with some flags to
indicate/communicate this frame comes with TX metadata hints.

BPF-prog API bpf_core_type_id_local:
   - [1]
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80

Userspace API btf__find_by_name_kind:
   - [2]
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185