Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc

sdf@xxxxxxxxxx · Fri, 24 Jun 2022 13:51:40 -0700

On 06/01, Cong Wang wrote:
From: Cong Wang <cong.wang@xxxxxxxxxxxxx>

This *incomplete* patchset introduces a programmable Qdisc with eBPF.

There are a few use cases:

1. Allow customizing Qdisc's in an easier way. So that people don't
    have to write a complete Qdisc kernel module just to experiment
    some new queuing theory.

2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
    is before enqueue, it is impossible to adjust those "tokens" after
    packets get dropped in enqueue. With eBPF Qdisc, it is easy to
    be solved with a shared map between clsact and sch_bpf.

3. Replace qevents, as now the user gains much more control over the
    skb and queues.

4. Provide a new way to reuse TC filters. Currently TC relies on filter
    chain and block to reuse the TC filters, but they are too complicated
    to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
    TC filters on _any_ Qdisc (even on a different netdev) to do the
    classification.

5. Potentially pave a way for ingress to queue packets, although
    current implementation is still only for egress.

6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
    is already used by TCP to handle TCP retransmission.

The goal here is to make this Qdisc as programmable as possible,
that is, to replace as many existing Qdisc's as we can, no matter
in tree or out of tree. This is why I give up on PIFO which has
serious limitations on the programmablity.

Here is a summary of design decisions I made:

1. Avoid eBPF struct_ops, as it would be really hard to program
    a Qdisc with this approach, literally all the struct Qdisc_ops
    and struct Qdisc_class_ops are needed to implement. This is almost
    as hard as programming a Qdisc kernel module.

2. Introduce skb map, which will allow other eBPF programs to store skb's
    too.

    a) As eBPF maps are not directly visible to the kernel, we have to
    dump the stats via eBPF map API's instead of netlink.

    b) The user-space is not allowed to read the entire packets, only  
__sk_buff
    itself is readable, because we don't have such a use case yet and it  
would
    require a different API to read the data, as map values have fixed  
length.

    c) Two eBPF helpers are introduced for skb map operations:
    bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
    not allowed.

    d) Multi-queue support is implemented via map-in-map, in a similar
    push/pop fasion.

    e) Use the netdevice notifier to reset the packets inside skb map upon
    NETDEV_DOWN event.

3. Integrate with existing TC infra. For example, if the user doesn't want
    to implement her own filters (e.g. a flow dissector), she should be  
able
    to re-use the existing TC filters. Another helper  
bpf_skb_tc_classify() is
    introduced for this purpose.

Any high-level feedback is welcome. Please kindly do not review any coding
details until RFC tag is removed.

TODO:
1. actually test it

Can you try to implement some existing qdisc using your new mechanism?
For BPF-CC, Martin showcased how dctcp/cubic can be reimplemented;
I feel like this patch series (even rfc), should also have a good example
to show that bpf qdisc is on par and can be used to at least implement
existing policies. fq/fq_codel/cake are good candidates.