On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > > > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >> > >> Packet forwarding is an important use case for XDP, which offers > >> significant performance improvements compared to forwarding using the > >> regular networking stack. However, XDP currently offers no mechanism to > >> delay, queue or schedule packets, which limits the practical uses for > >> XDP-based forwarding to those where the capacity of input and output links > >> always match each other (i.e., no rate transitions or many-to-one > >> forwarding). It also prevents an XDP-based router from doing any kind of > >> traffic shaping or reordering to enforce policy. > >> > >> This series represents a first RFC of our attempt to remedy this lack. The > >> code in these patches is functional, but needs additional testing and > >> polishing before being considered for merging. I'm posting it here as an > >> RFC to get some early feedback on the API and overall design of the > >> feature. > >> > >> DESIGN > >> > >> The design consists of three components: A new map type for storing XDP > >> frames, a new 'dequeue' program type that will run in the TX softirq to > >> provide the stack with packets to transmit, and a set of helpers to dequeue > >> packets from the map, optionally drop them, and to schedule an interface > >> for transmission. > >> > >> The new map type is modelled on the PIFO data structure proposed in the > >> literature[0][1]. It represents a priority queue where packets can be > >> enqueued in any priority, but is always dequeued from the head. From the > >> XDP side, the map is simply used as a target for the bpf_redirect_map() > >> helper, where the target index is the desired priority. > > > > I have the same question I asked on the series from Cong: > > Any considerations for existing carousel/edt-like models? > > Well, the reason for the addition in patch 5 (continuously increasing > priorities) is exactly to be able to implement EDT-like behaviour, where > the priority is used as time units to clock out packets. Ah, ok, I didn't read the patches closely enough. I saw some limits for the ranges and assumed that it wasn't capable of efficiently storing 64-bit timestamps.. > > Can we make the map flexible enough to implement different qdisc > > policies? > > That's one of the things we want to be absolutely sure about. We are > starting out with the PIFO map type because the literature makes a good > case that it is flexible enough to implement all conceivable policies. > The goal of the test harness linked as note [4] is to actually examine > this; Frey is our PhD student working on this bit. > > Thus far we haven't hit any limitations on this, but we'll need to add > more policies before we are done with this. Another consideration is > performance, of course, so we're also planning to do a comparison with a > more traditional "bunch of FIFO queues" type data structure for at least > a subset of the algorithms. Kartikeya also had an idea for an > alternative way to implement a priority queue using (semi-)lockless > skiplists, which may turn out to perform better. > > If there's any particular policy/algorithm you'd like to see included in > this evaluation, please do let us know, BTW! :) I honestly am not sure what the bar for accepting this should be. But on the Cong's series I mentioned Martin's CC bpf work as a great example of what we should be trying to do for qdisc-like maps. Having a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be very convincing :-) > >> The dequeue program type is a new BPF program type that is attached to an > >> interface; when an interface is scheduled for transmission, the stack will > >> execute the attached dequeue program and, if it returns a packet to > >> transmit, that packet will be transmitted using the existing ndo_xdp_xmit() > >> driver function. > >> > >> The dequeue program can obtain packets by pulling them out of a PIFO map > >> using the new bpf_packet_dequeue() helper. This returns a pointer to an > >> xdp_md structure, which can be dereferenced to obtain packet data and > >> data_meta pointers like in an XDP program. The returned packets are also > >> reference counted, meaning the verifier enforces that the dequeue program > >> either drops the packet (with the bpf_packet_drop() helper), or returns it > >> for transmission. Finally, a helper is added that can be used to actually > >> schedule an interface for transmission using the dequeue program type; this > >> helper can be called from both XDP and dequeue programs. > >> > >> PERFORMANCE > >> > >> Preliminary performance tests indicate about 50ns overhead of adding > >> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS > >> overhead (but still 2x the forwarding performance of the netstack): > >> > >> xdp_fwd : 4.7 Mpps (213 ns /pkt) > >> xdp_fwd -Q: 3.8 Mpps (263 ns /pkt) > >> netstack: 2 Mpps (500 ns /pkt) > >> > >> RELATION TO BPF QDISC > >> > >> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in > >> particular the use of a map to store packets. This is no accident, as we've > >> had ongoing discussions for a while now. I have no great hope that we can > >> completely converge the two efforts into a single BPF-based queueing > >> API (as has been discussed before[3], consolidating the SKB and XDP paths > >> is challenging). Rather, I'm hoping that we can converge the designs enough > >> that we can share BPF code between XDP and qdisc layers using common > >> functions, like it's possible to do with XDP and TC-BPF today. This would > >> imply agreeing on the map type and API, and possibly on the set of helpers > >> available to the BPF programs. > > > > What would be the big difference for the map wrt xdp_frame vs sk_buff > > excluding all obvious stuff like locking/refcnt? > > I expect it would be quite straight-forward to just add a second subtype > of the PIFO map in this series that holds skbs. In fact, I think that > from the BPF side, the whole model implemented here would be possible to > carry over to the qdisc layer more or less wholesale. Some other > features of the qdisc layer, like locking, classes, and > multi-CPU/multi-queue management may be trickier, but I'm not sure how > much of that we should expose in a BPF qdisc anyway (as you may have > noticed I commented on Cong's series to this effect regarding the > classful qdiscs). Maybe a related question here: with the way you do BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would be have support for storing xdp_frames/skb in any map? Let's say we have generic BPF_MAP_TYPE_RBTREE, where the key is priority/timestamp/whatever, can we, based on the value's btf_id, figure out the rest? (that the value is kernel structure and needs special care and more constraints - can't be looked up from user space and so on) Seems like we really need to have two special cases: where we transfer ownership of xdp_frame/skb to/from the map, any other big complications? That way we can maybe untangle the series a bit: we can talk about efficient data structures for storing frames/skbs independently of some generic support for storing them in the maps. Any major complications with that approach? > >> PATCH STRUCTURE > >> > >> This series consists of a total of 17 patches, as follows: > >> > >> Patches 1-3 are smaller preparatory refactoring patches used by subsequent > >> patches. > > > > Seems like these can go separately without holding the rest? > > Yeah, guess so? They don't really provide much benefit without the users > alter in the series, though, so not sure there's much point in sending > them separately? > > >> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue > >> program type. > > > > [...] > > > >> Patches 7-10 adds the dequeue helpers and the verifier features needed to > >> recognise packet pointers, reference count them, and allow dereferencing > >> them to obtain packet data pointers. > > > > Have you considered using kfuncs for these instead of introducing new > > hooks/contexts/etc? > > I did, but I'm not sure it's such a good fit? In particular, the way the > direct packet access is implemented for dequeue programs (where you can > get an xdp_md pointer and deref that to get data and data_end pointers) > is done this way so programs can share utility functions between XDP and > dequeue programs. And having a new program type for the dequeue progs > seem like the obvious thing to do since they're doing something new? > > Maybe I'm missing something, though; could you elaborate on how you'd > use kfuncs instead? I was thinking about the approach in general. In networking bpf, we've been adding new program types, new contexts and new explicit hooks. This all requires a ton of boiler plate (converting from uapi ctx to the kernel, exposing hook points, etc, etc). And looking at Benjamin's HID series, it's so much more elegant: there is no uapi, just kernel function that allows it to be overridden and a bunch of kfuncs exposed. No uapi, no helpers, no fake contexts. For networking and xdp the ship might have sailed, but I was wondering whether we should be still stuck in that 'old' boilerplate world or we have a chance to use new nice shiny things :-) (but it might be all moot if we'd like to have stable upis?)