Re: [LSF/MM/BPF TOPIC] XDP metadata for TX

Stanislav Fomichev <sdf@xxxxxxxxxx> · Thu, 9 Mar 2023 10:04:45 -0800

On Tue, Mar 7, 2023 at 11:32 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:
>
>
> On 03/03/2023 08.42, Magnus Karlsson wrote:
> > On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> >> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >>> Stanislav Fomichev <sdf@xxxxxxxxxx> writes:
> >>>> On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> Stanislav Fomichev <sdf@xxxxxxxxxx> writes:
> >>>>>
> >>>>>> I'd like to discuss a potential follow up for the previous "XDP RX
> >>>>>> metadata" series [0].
> >>>>>>
> >>>>>> Now that we can access (a subset of) packet metadata at RX, I'd like to
> >>>>>> explore the options where we can export some of that metadata on TX. And
> >>>>>> also whether it might be possible to access some of the TX completion
> >>>>>> metadata (things like TX timestamp).
> >>>>>>
>
> IMHO it makes sense to see TX metadata as two separate operations.
>
>   (1) Metadata written into the TX descriptor.
>   (2) Metadata read when processing TX completion.
>
> These operations happen at two different points in time. Thus likely
> need different BPF hooks.   Having BPF-progs running at each of these
> points in time, will allow us to e.g. implement BQL (which is relevant
> to XDP queuing effort).

I guess for (2) the question here is: is it worth having a separate
hook? Or will a simple traceepoint as Toke suggested be enough? For
BQL purposes, we can still attach a prog to that tracepoint, right?

> >>>>>> I'm currently trying to understand whether the same approach I've used
> >>>>>> on RX could work at TX. By May I plan to have a bunch of options laid
> >>>>>> out (currently considering XSK tx/compl programs and XDP tx/compl
> >>>>>> programs) so we have something to discuss.
> >>>>>
> >>>>> I've been looking at ways of getting a TX-completion hook for the XDP
> >>>>> queueing stuff as well. For that, I think it could work to just hook
> >>>>> into xdp_return_frame(), but if you want to access hardware metadata
> >>>>> it'll obviously have to be in the driver. A hook in the driver could
> >>>>> certainly be used for the queueing return as well, though, which may
> >>>>> help making it worth the trouble :)
> >>>>
> >>>> Yeah, I'd like to get to completion descriptors ideally; so nothing
> >>>> better than a driver hook comes to mind so far :-(
>
> As Toke mentions, I'm also hoping we could leverage or extend the
> xdp_return_frame() call.  Or implicitly add the "hook" at the existing
> xdp_return_frame() call. This is about operation (2) *reading* some
> metadata at TX completion time.

Ack, noted, thx. Although, at least for mlx5e_free_xdpsq_desc, I don't
see it being called for the af_xdp tx path. But maybe that's something
we can amend in a couple of places (so xdp_return_frame would handle
most xdp cases, and some new tbd func for af_xdp tx)?

> Can this be mapped to the RX-kfuncs approach(?), by driver extending
> (call/structs) with pointer to TX-desc + adaptor info and BPF-prog/hook
> doing TX-kfuncs calls into driver (that knows how to extract completion
> data).

Yeah, that seems like a natural thing to do here.

>
> [...]
> >>> Well, to me XDP_REDIRECT is the most interesting one (see above). I
> >>> think we could even drop the XDP_TX case and only do this for
> >>> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
> >>> I.e., it's possible to XDP_REDIRECT back to the same device, the frames
> >>> will just take a slight detour up through the stack; but that could also
> >>> be a good thing if it means we'll have to do less surgery to the drivers
> >>> to implement this for two paths.
> >>>
> >>> It does have the same challenge as you outlined above, though: At that
> >>> point the TX descriptor probably doesn't exist, so the driver NDO will
> >>> have to do something else with the data; but maybe we can solve that
> >>> without moving the hook into the driver itself somehow?
> >>
> >> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
> >> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
> >> with a skb-like-private metadata layout (as we've discussed previously
> >> for skb) here as well? But not sure it would solve all the problems?
>
> This is operation (1) writing metadata into the TX descriptor.
> In this case we have a metadata mapping problem, from RX on one device
> driver to TX on another device driver. As you say, we also need to map
> this SKBs, which have a fairly static set of metadata.
>
> For the most common metadata offloads (like TX-checksum, TX-vlan) I
> think it makes sense to store those in xdp_frame area (use for SKB
> mapping) and re-use these when at TX writing into the TX descriptor.

[..]

> BUT there are also metadata TX offloads offloads, like asking for a
> specific Launch-Time for at packet, that needs a more flexible approach.

Why can't these go into the same "common" xdp_frame area?

> >> I'm thinking of an af_xdp case where it wants to program something
> >> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
> >> support) or a checksum offload. Exposing access to the driver tx hooks
> >> seems like the easiest way to get there?
> >>
> >>>> - AF_XDP TX - this one needs something deep in the driver (due to tx
> >>>> zc) to populate the descriptors?
> >>>
> >>> Yeah, this one is a bit more challenging, but having a way to process
> >>> AF_XDP frames in the kernel before they're sent out would be good in any
> >>> case (for things like policing what packets an AF_XDP application can
> >>> send in a cloud deployment, for instance). Would be best if we could
> >>> consolidate the XDP_REDIRECT and AF_XDP paths, I suppose...
> >>>
>
> I agree, it would be best if we can consolidate the XDP_REDIRECT and
> AF_XDP paths, else we have to re-implement the same for AF_XDP xmit path
> (and maintain both paths). I also agree that being able to police what
> packets an AF_XDP application can send is a useful feature (e.g. cloud
> deployments).
>
> Looking forward to collaborate on this work!
> --Jesper

Thank you for the comments! So it looks like two things we potentially
need to do/agree upon:
1. User-facing API. One hook + tracepoint vs two hooks (and at what
level: af_xdp vs xdp). I'll try to focus on that first (waiting for
af_xdp patches that Magnus mentioned).
2. Potentially internal refactoring to consolidate XDP_REDIRECT+AF_XDP
(seems like something we should be able to discuss as we go; aka
implementation details)