Re: [LSF/MM/BPF TOPIC] bpf qdisc

Amery Hung <ameryhung@xxxxxxxxx> · Fri, 1 Mar 2024 15:30:08 -0800

On Fri, Mar 1, 2024 at 12:08 PM Kumar Kartikeya Dwivedi
<memxor@xxxxxxxxx> wrote:
>
> On Fri, 1 Mar 2024 at 20:28, Amery Hung <ameryhung@xxxxxxxxx> wrote:
> >
> > On Fri, Mar 1, 2024 at 7:06 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
> > >
> > > On Fri, 1 Mar 2024 at 16:01, Amery Hung <ameryhung@xxxxxxxxx> wrote:
> > > >
> > > > On Fri, Mar 1, 2024 at 6:08 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> > > > >
> > > > > Amery Hung <ameryhung@xxxxxxxxx> writes:
> > > > >
> > > > > > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> > > > > >>
> > > > > >> Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> writes:
> > > > > >>
> > > > > >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@xxxxxxxxx> wrote:
> > > > > >> >>
> > > > > >> >> Hi all,
> > > > > >> >>
> > > > > >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
> > > > > >> >> support bpf qdisc using struct_ops, we found some limitations of
> > > > > >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
> > > > > >> >> list, we can discuss in more detail to make struct_ops a more
> > > > > >> >> generic/palatable approach to replace kernel functions.
> > > > > >> >>
> > > > > >> >> In addition, I would like to discuss supporting adding kernel objects
> > > > > >> >> to bpf_list/rbtree, which may have performance benefits in some
> > > > > >> >> applications and can improve the programming experience. The current
> > > > > >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
> > > > > >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
> > > > > >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
> > > > > >> >> becomes comparable. We can discuss the approach and other potential
> > > > > >> >> use cases.
> > > > > >> >>
> > > > > >> >
> > > > > >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
> > > > > >> > patch set, we discussed the same thing, in that the sk_buff already
> > > > > >> > has space for a list or rbnode due to it getting queued in other
> > > > > >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
> > > > > >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
> > > > > >> > allow inserting it as an element into a BPF list or rbtree. Back then
> > > > > >> > we didn't add that as the posting only used the PIFO map.
> > > > > >> >
> > > > > >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
> > > > > >> > well.
> > > > > >>
> > > > > >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
> > > > > >> rbtree would make a lot of sense if it can be done safely. I am less
> > > > > >> sure about xdp_frames, mostly for performance reasons, but if it does
> > > > > >> turn out to be useful whichever mechanism we add for skbs should be
> > > > > >> fairly straight forward to reuse.
> > > > > >>
> > > > > >> > The verifier side changes should be fairly minimal, just allowing the
> > > > > >> > use of a known kernel type as the contained object in a list or
> > > > > >> > rbtree, and the field pointing to this allowlisted list or rbnode.
> > > > > >>
> > > > > >> I think one additional concern here is how we ensure that an skb has
> > > > > >> been correctly removed from any rbtrees it sits in before it is being
> > > > > >> transmitted to another part of the stack?
> > > > > >
> > > > > > I think one solution is to disallow shared ownership of skb in
> > > > > > multiple lists or rbtrees. That is, users should not be able to
> > > > > > acquire the reference of an skb from the ctx more than once in
> > > > > > ".enqueue" or using bpf_refcount_acquire().
> > > > >
> > > > > Can the verifier enforce this, even across multiple enqueue/dequeue
> > > > > calls? Not sure if acquiring a refcount ensures that the rbtree entry
> > > > > has been cleared?
> > > > >
> > > > > Basically, I'm worried about a dequeue() op that does something like:
> > > > >
> > > > > skb = rbtree_head();
> > > > > // skb->rbnode is not cleared
> > > > > return skb; // stack will keep processing it
> > > > >
> > > > > I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
> > > > > the verifier is already ensuring that a node cannot be read from a tree
> > > > > without being properly cleared from it?
> > > > >
> > > >
> > > > I see what you are saying now, and thanks Kumar for the clarification!
> > > >
> > > > I was thinking about how to prevent an skb from being added to lists
> > > > and rbtrees at the same time, since list and rbnode share the same
> > > > space. Hence the suggestion.
> > > >
> > >
> > > In BPF qdisc programs, you could teach the verifier that the skb has
> > > reference semantics (ref_obj_id > 0),
> > > in such a case once you push it into a list or rbtree, the program
> > > will lose ownership of the skb and all pointers same as the skb will
> > > be marked invalid.
> > > You could use some peek helper to look at it, but will never have an
> > > skb with program ownership until you pop it back from a list or
> > > rbtree.
> > >
> >
> > This part makes sense. In the enqueue() op of bpf qdisc, I use a kfunc
> > to acquire an skb kptr (ref_obj_id > 0) from the skb in ctx for now.
> > Martin suggested tracking reads from ctx and assigning ref_obj_id.
> >
> > However, either way, if users can do this multiple times in one
> > enqueue() call like below, they can acquire multiple references to the
> > same skb and put them on different lists/rbtrees. This is what I'd
> > like to avoid.
> >
> > SEC("struct_ops/bpf_fifo_enqueue")
> > int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> > struct bpf_sk_buff_ptr *to_free)
> > {
> >         ...
> >         skb_kptr_a = bpf_skb_acquire(skb);
> >         skb_kptr_b = bpf_skb_acquire(skb);
>
> Yeah, this acquire kfunc is the root cause. That basically is a sign
> that 'multiple references are ok' even though that's not the case.
> It would have been the least intrusive way to allow skb kptrs, but it
> wouldn't work well to enforce unique ownership of the skb.
>
> Instead of this, we could teach the verifier that a struct ops
> argument can be an acquired pointer being passed as a parameter to the
> BPF program.
> Then, on entry into the program, it would have a reference state
> created for it and the corresponding ref_obj_id be assigned to that
> when loading it from ctx.
> When the reference state goes away, loading it again from ctx will
> just mark it as an unknown scalar or return an error.

This seems to be the more natural way to enforce such semantics rather
than hacking kfuncs. I will try to incorporate this into the next RFC.
Thank you!

-Amery

>
> Then, you can only push the skb into a list or rbtree once.
>
> >
> >         bpf_list_push_back(&list_1, skb_kptr_a->bpf_list);
> >         bpf_list_push_back(&list_2, skb_kptr_b->bpf_list);
> >         ...
> >
> > Thanks,
> > Amery
> >
> > > In the XDP queueing series, we taught the verifier to have reference
> > > semantics for xdp_md in the dequeue program, and then return such a
> > > pointer from the program back to the kernel.
> > > The changes to allow PTR_TO_PACKET accesses were also fairly simple,
> > > the verifier just needs to know that comparison of data, data_end can
> > > only be done for pkt pointers coming from the same xdp_md (as there
> > > can be multiple in the same program at a time).