On Thu, Jul 14, 2022 at 3:46 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > > > On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >> > >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > >> > >> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >> >> > >> >> Packet forwarding is an important use case for XDP, which offers > >> >> significant performance improvements compared to forwarding using the > >> >> regular networking stack. However, XDP currently offers no mechanism to > >> >> delay, queue or schedule packets, which limits the practical uses for > >> >> XDP-based forwarding to those where the capacity of input and output links > >> >> always match each other (i.e., no rate transitions or many-to-one > >> >> forwarding). It also prevents an XDP-based router from doing any kind of > >> >> traffic shaping or reordering to enforce policy. > >> >> > >> >> This series represents a first RFC of our attempt to remedy this lack. The > >> >> code in these patches is functional, but needs additional testing and > >> >> polishing before being considered for merging. I'm posting it here as an > >> >> RFC to get some early feedback on the API and overall design of the > >> >> feature. > >> >> > >> >> DESIGN > >> >> > >> >> The design consists of three components: A new map type for storing XDP > >> >> frames, a new 'dequeue' program type that will run in the TX softirq to > >> >> provide the stack with packets to transmit, and a set of helpers to dequeue > >> >> packets from the map, optionally drop them, and to schedule an interface > >> >> for transmission. > >> >> > >> >> The new map type is modelled on the PIFO data structure proposed in the > >> >> literature[0][1]. It represents a priority queue where packets can be > >> >> enqueued in any priority, but is always dequeued from the head. From the > >> >> XDP side, the map is simply used as a target for the bpf_redirect_map() > >> >> helper, where the target index is the desired priority. > >> > > >> > I have the same question I asked on the series from Cong: > >> > Any considerations for existing carousel/edt-like models? > >> > >> Well, the reason for the addition in patch 5 (continuously increasing > >> priorities) is exactly to be able to implement EDT-like behaviour, where > >> the priority is used as time units to clock out packets. > > > > Ah, ok, I didn't read the patches closely enough. I saw some limits > > for the ranges and assumed that it wasn't capable of efficiently > > storing 64-bit timestamps.. > > The goal is definitely to support full 64-bit priorities. Right now you > have to start out at 0 but can go on for a full 64 bits, but that's a > bit of an API wart that I'd like to get rid of eventually... > > >> > Can we make the map flexible enough to implement different qdisc > >> > policies? > >> > >> That's one of the things we want to be absolutely sure about. We are > >> starting out with the PIFO map type because the literature makes a good > >> case that it is flexible enough to implement all conceivable policies. > >> The goal of the test harness linked as note [4] is to actually examine > >> this; Frey is our PhD student working on this bit. > >> > >> Thus far we haven't hit any limitations on this, but we'll need to add > >> more policies before we are done with this. Another consideration is > >> performance, of course, so we're also planning to do a comparison with a > >> more traditional "bunch of FIFO queues" type data structure for at least > >> a subset of the algorithms. Kartikeya also had an idea for an > >> alternative way to implement a priority queue using (semi-)lockless > >> skiplists, which may turn out to perform better. > >> > >> If there's any particular policy/algorithm you'd like to see included in > >> this evaluation, please do let us know, BTW! :) > > > > I honestly am not sure what the bar for accepting this should be. But > > on the Cong's series I mentioned Martin's CC bpf work as a great > > example of what we should be trying to do for qdisc-like maps. Having > > a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be > > very convincing :-) > > Just doing flow queueing is quite straight forward with PIFOs. We're > working on fq_codel. Personally I also want to implement something that > has feature parity with sch_cake (which includes every feature and the > kitchen sink already) :) Yeah, sch_cake works too 👍 > >> >> The dequeue program type is a new BPF program type that is attached to an > >> >> interface; when an interface is scheduled for transmission, the stack will > >> >> execute the attached dequeue program and, if it returns a packet to > >> >> transmit, that packet will be transmitted using the existing ndo_xdp_xmit() > >> >> driver function. > >> >> > >> >> The dequeue program can obtain packets by pulling them out of a PIFO map > >> >> using the new bpf_packet_dequeue() helper. This returns a pointer to an > >> >> xdp_md structure, which can be dereferenced to obtain packet data and > >> >> data_meta pointers like in an XDP program. The returned packets are also > >> >> reference counted, meaning the verifier enforces that the dequeue program > >> >> either drops the packet (with the bpf_packet_drop() helper), or returns it > >> >> for transmission. Finally, a helper is added that can be used to actually > >> >> schedule an interface for transmission using the dequeue program type; this > >> >> helper can be called from both XDP and dequeue programs. > >> >> > >> >> PERFORMANCE > >> >> > >> >> Preliminary performance tests indicate about 50ns overhead of adding > >> >> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS > >> >> overhead (but still 2x the forwarding performance of the netstack): > >> >> > >> >> xdp_fwd : 4.7 Mpps (213 ns /pkt) > >> >> xdp_fwd -Q: 3.8 Mpps (263 ns /pkt) > >> >> netstack: 2 Mpps (500 ns /pkt) > >> >> > >> >> RELATION TO BPF QDISC > >> >> > >> >> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in > >> >> particular the use of a map to store packets. This is no accident, as we've > >> >> had ongoing discussions for a while now. I have no great hope that we can > >> >> completely converge the two efforts into a single BPF-based queueing > >> >> API (as has been discussed before[3], consolidating the SKB and XDP paths > >> >> is challenging). Rather, I'm hoping that we can converge the designs enough > >> >> that we can share BPF code between XDP and qdisc layers using common > >> >> functions, like it's possible to do with XDP and TC-BPF today. This would > >> >> imply agreeing on the map type and API, and possibly on the set of helpers > >> >> available to the BPF programs. > >> > > >> > What would be the big difference for the map wrt xdp_frame vs sk_buff > >> > excluding all obvious stuff like locking/refcnt? > >> > >> I expect it would be quite straight-forward to just add a second subtype > >> of the PIFO map in this series that holds skbs. In fact, I think that > >> from the BPF side, the whole model implemented here would be possible to > >> carry over to the qdisc layer more or less wholesale. Some other > >> features of the qdisc layer, like locking, classes, and > >> multi-CPU/multi-queue management may be trickier, but I'm not sure how > >> much of that we should expose in a BPF qdisc anyway (as you may have > >> noticed I commented on Cong's series to this effect regarding the > >> classful qdiscs). > > > > Maybe a related question here: with the way you do > > BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would > > be have support for storing xdp_frames/skb in any map? Let's say we > > have generic BPF_MAP_TYPE_RBTREE, where the key is > > priority/timestamp/whatever, can we, based on the value's btf_id, > > figure out the rest? (that the value is kernel structure and needs > > special care and more constraints - can't be looked up from user space > > and so on) > > > > Seems like we really need to have two special cases: where we transfer > > ownership of xdp_frame/skb to/from the map, any other big > > complications? > > > > That way we can maybe untangle the series a bit: we can talk about > > efficient data structures for storing frames/skbs independently of > > some generic support for storing them in the maps. Any major > > complications with that approach? > > I've had discussions with Kartikeya on this already (based on his 'kptr > in map' work). That may well end up being feasible, which would be > fantastic. The reason we didn't use it for this series is that there's > still some work to do on the generic verifier/infrastructure support > side of this (the PIFO map is the oldest part of this series), and I > didn't want to hold up the rest of the queueing work until that landed. Yes, exactly, kptr seems like a very promising thing that you can leverage. I'm looking forward to it! > Now that we have a functional prototype I expect that iterating on the > data structure will be the next step. One complication with XDP is that > we probably want to keep using XDP_REDIRECT to place packets into the > map because that gets us bulking which is important for performance; > however, in general I like the idea of using BTF to designate the map > value type, and if we can figure out a way to make it completely generic > even for packets I'm all for that! :) As long as we have generic kptr-based-skb-capable-maps and can add/remove/lookup skbs using existing helpers it seems fine to have XDP_REDIRECT as some kind of xdp-specific optimization. > >> >> PATCH STRUCTURE > >> >> > >> >> This series consists of a total of 17 patches, as follows: > >> >> > >> >> Patches 1-3 are smaller preparatory refactoring patches used by subsequent > >> >> patches. > >> > > >> > Seems like these can go separately without holding the rest? > >> > >> Yeah, guess so? They don't really provide much benefit without the users > >> alter in the series, though, so not sure there's much point in sending > >> them separately? > >> > >> >> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue > >> >> program type. > >> > > >> > [...] > >> > > >> >> Patches 7-10 adds the dequeue helpers and the verifier features needed to > >> >> recognise packet pointers, reference count them, and allow dereferencing > >> >> them to obtain packet data pointers. > >> > > >> > Have you considered using kfuncs for these instead of introducing new > >> > hooks/contexts/etc? > >> > >> I did, but I'm not sure it's such a good fit? In particular, the way the > >> direct packet access is implemented for dequeue programs (where you can > >> get an xdp_md pointer and deref that to get data and data_end pointers) > >> is done this way so programs can share utility functions between XDP and > >> dequeue programs. And having a new program type for the dequeue progs > >> seem like the obvious thing to do since they're doing something new? > >> > >> Maybe I'm missing something, though; could you elaborate on how you'd > >> use kfuncs instead? > > > > I was thinking about the approach in general. In networking bpf, we've > > been adding new program types, new contexts and new explicit hooks. > > This all requires a ton of boiler plate (converting from uapi ctx to > > the kernel, exposing hook points, etc, etc). And looking at Benjamin's > > HID series, it's so much more elegant: there is no uapi, just kernel > > function that allows it to be overridden and a bunch of kfuncs > > exposed. No uapi, no helpers, no fake contexts. > > > > For networking and xdp the ship might have sailed, but I was wondering > > whether we should be still stuck in that 'old' boilerplate world or we > > have a chance to use new nice shiny things :-) > > > > (but it might be all moot if we'd like to have stable upis?) > > Right, I see what you mean. My immediate feeling is that having an > explicit stable UAPI for XDP has served us well. We do all kinds of > rewrite tricks behind the scenes (things like switching between xdp_buff > and xdp_frame, bulking, direct packet access, reading ifindexes by > pointer walking txq->dev, etc) which are important ways to improve > performance without exposing too many nitty-gritty details into the API. > > There's also consistency to consider: I think the addition of queueing > should work as a natural extension of the existing programming model for > XDP. So I feel like this is more a case of "if we were starting from > scratch today we might do things differently (like the HID series), but > when extending things let's keep it consistent"? Agreed. If we want to have the ability to inspect/change xdp/skb in your new dequeue hooks, we might need to keep that fake xdp_buff for consistency :-(