Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 25 Aug 2022 14:53:02 -0700

On Wed, Aug 24, 2022 at 4:04 PM Kumar Kartikeya Dwivedi
<memxor@xxxxxxxxx> wrote:
>
> On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >
> > On Wed, Aug 24, 2022 at 3:39 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> > >
> > > Joanne Koong <joannelkoong@xxxxxxxxx> writes:
> > > >> [...]
> > > >> It's a bit awkward to have such a difference between xdp and skb
> > > >> dynptr's read/write. I understand why it is the way it is, but it
> > > >> still doesn't feel right. I'm not sure if we can reconcile the
> > > >> differences, but it makes writing common code for both xdp and tc
> > > >> harder as it needs to be aware of the differences (and then the flags
> > > >> for dynptr_write would differ too). So we're 90% there but not the
> > > >> whole way...
> > > >
> > > > Yeah, it'd be great if the behavior for skb/xdp progs could be the
> > > > same, but I'm not seeing a better solution here (unless we invalidate
> > > > data slices on writes in xdp progs, just to make it match more :P).
> > > >
> > > > Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> > > > not convinced this is much of a problem - xdp and skb programs already
> > > > have different interfaces for doing things (eg
> > > > bpf_{skb/xdp}_{store/load}_bytes).
> > >
> > > This is true, but it's quite possible to paper over these differences
> > > and write BPF code that works for both TC and XDP. Subtle semantic
> > > differences in otherwise identical functions makes this harder.
> > >
> > > Today you can write a function like:
> > >
> > > static inline int parse_pkt(void *data, void* data_end)
> > > {
> > >         /* parse data */
> > > }
> > >
> > > And call it like:
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         return parse_pkt(ctx->data, ctx->data_end);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         return parse_pkt(skb->data, skb->data_end);
> > > }
> > >
> > >
> > > IMO the goal should be to be able to do the equivalent for dynptrs, like:
> > >
> > > static inline int parse_pkt(struct bpf_dynptr *ptr)
> > > {
> > >         __u64 *data;
> > >
> > >         data = bpf_dynptr_data(ptr, 0, sizeof(*data));
> > >         if (!data)
> > >                 return 0;
> > >         /* parse data */
> > > }
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_xdp(ctx, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_skb(skb, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> >
> > To clarify, this is already possible when using data slices, since the
> > behavior for data slices is equivalent between xdp and tc programs for
> > non-fragmented accesses. From looking through the selftests, I
> > anticipate that data slices will be the main way programs interact
> > with dynptrs. For the cases where the program may write into frags,
> > then bpf_dynptr_write will be needed (which is where functionality
> > between xdp and tc start differing) - today, we're not able to write
> > common code that writes into the frags since tc uses
> > bpf_skb_store_bytes and xdp uses bpf_xdp_store_bytes.
>
> Yes, we cannot write code that just swaps those two calls right now.
> The key difference is, the two above are separate functions already
> and have different semantics. Here in this set you have the same
> function, but with different semantics depending on ctx, which is the
> point of contention.

bpf_dynptr_{read,write}() are effectively virtual methods of dynptr
and we have few different implementations of dynptr, depending on what
data they are wrapping, so having different semantics depending on ctx
makes sense, if they are dictated by good reasons (like good
performance for skb and xdp).

>
> >
> > I'm more and more liking the idea of limiting xdp to match the
> > constraints of skb given that both you, Kumar, and Jakub have
> > mentioned that portability between xdp and skb would be useful for
> > users :)
> >
> > What are your thoughts on this API:
> >
> > 1) bpf_dynptr_data()
> >
> > Before:
> >   for skb-type progs:
> >       - data slices in fragments is not supported
> >
> >   for xdp-type progs:
> >       - data slices in fragments is supported as long as it is in a
> > contiguous frag (eg not across frags)
> >
> > Now:
> >   for skb + xdp type progs:
> >       - data slices in fragments is not supported
> >
> >
> > 2)  bpf_dynptr_write()
> >
> > Before:
> >   for skb-type progs:
> >      - all data slices are invalidated after a write
> >
> >   for xdp-type progs:
> >      - nothing
> >
> > Now:
> >   for skb + xdp type progs:
> >      - all data slices are invalidated after a write
> >
>
> There is also the other option: failing to write until you pull skb,
> which looks a lot better to me if we are adding this helper anyway...

Wouldn't this kill performance for typical cases when writes don't go
into frags?

>
> > This will unite the functionality for skb and xdp programs across
> > bpf_dyntpr_data, bpf_dynptr_write, and bpf_dynptr_read. As for whether
> > we should unite bpf_dynptr_from_skb and bpf_dynptr_from_xdp into one
> > common bpf_dynptr_from_packet as Jakub brought in [0], I'm leaning
> > towards no because 1) if in the future there's some irreconcilable
> > aspect between skb and xdp that gets added, that'll be hard to support
> > since the expectation is that there is just one overall "packet
> > dynptr" 2) the "packet dynptr" view is not completely accurate (eg
> > bpf_dynptr_write accepts flags from skb progs and not xdp progs) 3)
> > this adds some additional hardcoding in the verifier since there's no
> > organic mapping between prog type -> prog ctx
> >
>
> There is, e.g. see how btf_get_prog_ctx_type is doing it (unless I
> misunderstood what you meant).
>
> I also had a different tangential question (unrelated to
> 'reconciliation'). Let's forget about that for a moment. I was
> listening to the talk here [0]. At 12:00, I can hear someone
> mentioning that you'd have a dynptr for each segment/frag.

One of those people was me :) I think what you are referring to was an
idea that for frags we might end a special iterator function that will
call a callback for each linear frag. And in such case linear frag
will be presented to callback as dynptr (just as an interface to
statically unknown memory region, it will be DYNPTR_LOCAL at that
point, probably).

It's different from DYNPTR_SKB that Joanne is adding here, though.

>
> Right now, you have a skb/xdp dynptr, which is representing
> discontiguous memory, i.e. you represent all linear page + fragments
> using one dynptr. This seems a bit inconsistent with the idea of a
> dynptr, since it's conceptually a slice of variable length memory.
> dynptr_data gives you a constant length slice into dynptr's variable
> length memory (which the verifier can track bounds of statically).

All the data in skb is conceptually a slice of variable length memory
with no gaps in between. It's just physically discontiguous, but there
is still a linear offset addressing (conceptually). So I think it
makes sense to represent skb data as a whole as one SKB dynptr. There
are technical restrictions on direct memory access through
bpf_dynptr_data(), inevitablly.

>
> So thinking about the idea in [0], one way would be to change
> from_xdp/from_skb helpers to take an index (into the frags array), and
> return dynptr for each frag. Or determine frag from offset + len. For
> the skb case it might require kmap_atomic to access the frag dynptr,
> which might need a paired kunmap_atomic call as well, and it may not
> return dynptr for frags in cloned skbs, but for XDP multi-buff all
> that doesn't count.

We could do something like bpf_dynptr_from_skb_frag(skb, N) and get
access to frag #N (where we can special case 0 being linear skb data),
but that would be DYNPTR_LOCAL at that point, probably? And it's a
separate extension in addition to DYNPTR_SKB. Having generic
bpf_dynptr_{read,write}() working over entire skb range is very
useful, while this per-frag dynptr would prevent this. So let's keep
those two separate.

>
> Was this approach/line of thinking considered and/or rejected? What
> were the reasons? Just trying to understand the background here.
>
> What I'm wondering is that we already have helpers to do reads and
> writes that you are wrapping over in dynptr interfaces, but what we're
> missing is to be able to get direct access to frags (or 'next' ctx,
> essentially). When we know we can avoid pulls, it might be cheaper to
> then do access this way to skb, right? Especially for a case where
> just a part of the header lies in the next frag?
>
> But I'm no expert here, so I might be missing some subtleties.
>
>   [0]: https://www.youtube.com/watch?v=EZ5PDrXvs7M
>
>
>
>
>
>
>
>
>
> >
> > [0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@xxxxxxxxx/T/#m1438f89152b1d0e539fe60a9376482bbc9de7b6e
> >
> > >
> > > If the dynptr-based parse_pkt() function has to take special care to
> > > figure out where the dynptr comes from, it makes it a lot more difficult
> > > to write reusable packet parsing functions. So I'd be in favour of
> > > restricting the dynptr interface to the lowest common denominator of the
> > > skb and xdp interfaces even if that makes things slightly more awkward
> > > in the specialised cases...
> > >
> > > -Toke
> > >