Cong Wang <xiyou.wangcong@xxxxxxxxx> writes: > From: Cong Wang <cong.wang@xxxxxxxxxxxxx> > > This *incomplete* patchset introduces a programmable Qdisc with eBPF. Sorry for the delay in looking at this; a few comments below: [...] > The goal here is to make this Qdisc as programmable as possible, > that is, to replace as many existing Qdisc's as we can, no matter > in tree or out of tree. This is why I give up on PIFO which has > serious limitations on the programmablity. Could you elaborate on the limitations of the PIFO? It looks to me like the SKB map you're proposing is very close to a PIFO (it's a priority queue that SKBs can be inserted into at an arbitrary position)? > Here is a summary of design decisions I made: > > 1. Avoid eBPF struct_ops, as it would be really hard to program > a Qdisc with this approach, literally all the struct Qdisc_ops > and struct Qdisc_class_ops are needed to implement. This is almost > as hard as programming a Qdisc kernel module. I agree that implementing the full Qdisc_class_ops as BPF functions is going to be prohibitive; let's use this opportunity to define a simpler API! > 2. Introduce skb map, which will allow other eBPF programs to store skb's > too. > > a) As eBPF maps are not directly visible to the kernel, we have to > dump the stats via eBPF map API's instead of netlink. Why not do a hybrid thing where the kernel side of the qdisc keeps some basic stats (packet counter for number of packets queued/dropped for instance) and define a separate BPF callback that can return more stats to the kernel for translation into netlink (e.g., "how many packets are currently in the queue")? And then if a particular implementation wants to do more custom stats, they can use BPF APIs for that? > b) The user-space is not allowed to read the entire packets, only > __sk_buff itself is readable, because we don't have such a use case > yet and it would require a different API to read the data, as map > values have fixed length. I agree there's not any need to make the packet contents directly accessible to userspace via reading the map. > c) Two eBPF helpers are introduced for skb map operations: > bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is > not allowed. So with kptr support in the map this could conceivably be done via the existing map_push() etc helpers? > d) Multi-queue support is implemented via map-in-map, in a similar > push/pop fasion. Not sure I understand what you mean by this? Will the qdisc automatically attach itself to all HWQs of an interface (like sch_mq does)? Surely it will be up to the BPF program whether it'll use map-in-map or something else? > e) Use the netdevice notifier to reset the packets inside skb map upon > NETDEV_DOWN event. Is it really necessary to pull out SKBs from under the BPF program? If the qdisc is removed and the BPF program goes away, so will the map and all SKBs stored in it? > 3. Integrate with existing TC infra. For example, if the user doesn't want > to implement her own filters (e.g. a flow dissector), she should be able > to re-use the existing TC filters. Another helper bpf_skb_tc_classify() is > introduced for this purpose. This seems a bit convoluted? If a BPF program wants to reuse an existing BPF TC filter it can just do that at the code level. As for the in-kernel filters are there really any of them it would be worth reusing from a BPF qdisc other than the flow dissector? In which case, why not just expose that instead of taking all the TC filter infrastructure with you? Bringing in some text from patch 4 as well to comment on it all in one go: > Introduce a new Qdisc which is completely managed by eBPF program > of type BPF_PROG_TYPE_SCHED_QDISC. It accepts two eBPF programs of the > same type, but one for enqueue and the other for dequeue. > > And it interacts with Qdisc layer in two ways: > 1) It relies on Qdisc watchdog to handle throttling; Having a way for BPF to schedule the qdisc watchdog wakeup is probably needed. For the XDP queueing side I'm planning to use BPF timers for (the equivalent of) this, but since we already have a watchdog mechanism on the qdisc side maybe just exposing that is fine... > 2) It could pass the skb enqueue/dequeue down to child classes I'm not sure I think it's a good idea to keep the qdisc hierarchy. If someone wants to build a classification hierarchy they could just do this directly in BPF by whichever mechanism they want? I.e., it can just instantiate multiple skb maps in a single program to do classful queueing and select the right one however it wants? Keeping the qdisc itself simple as, basically: enqueue: pass skb to BPF program to do with as it will (enqueue into a map, drop, etc) dequeue: execute BPF program, transmit pkt if it returns one -Toke