On Tue, 16 Apr 2019 at 18:53, Jonathan Lemon <jonathan.lemon@xxxxxxxxx> wrote: > > On 16 Apr 2019, at 6:55, Jesper Dangaard Brouer wrote: > > > On Mon, 15 Apr 2019 15:49:32 -0700 > > Jakub Kicinski <jakub.kicinski@xxxxxxxxxxxxx> wrote: > > > >> On Mon, 15 Apr 2019 18:32:58 +0200, Jesper Dangaard Brouer wrote: > >>> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel > >>> <bjorn.topel@xxxxxxxxx> wrote: > >>>> Hi, > >>>> > >>>> As you probably can derive from the amount of time this is taking, > >>>> I'm > >>>> not really satisfied with the design of per-queue XDP program. > >>>> (That, > >>>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my > >>>> thinking > >>>> in this mail! > >> > >> Jesper was advocating per-queue progs since very early days of XDP. > >> If it was easy to implement cleanly we would've already gotten it ;) > > > > (I cannot help to feel offended here... IMHO that statement is BS, > > that is not how upstream development work, and sure, I am to blame as > > I've simply been to lazy or busy with other stuff to implement it. It > > is not that hard to send down a queue# together with the XDP attach > > command.) > > > > I've been advocating for per-queue progs from day-1, since this is an > > obvious performance advantage, given the programmer can specialize the > > BPF/XDP-prog to the filtered traffic. I hope/assume we are on the > > same > > pages here, that per-queue progs is a performance optimization. > > > > I guess the rest of the discussion in this thread is (1) if we can > > convince each-other that someone will actually use this optimization, > > and (2) if we can abstract this away from the user. > > > > > >>>> Beware, it's kind of a long post, and it's all over the place. > >>> > >>> Cc'ing all the XDP-maintainers (and netdev). > >>> > >>>> There are a number of ways of setting up flows in the kernel, e.g. > >>>> > >>>> * Connecting/accepting a TCP socket (in-band) > >>>> * Using tc-flower (out-of-band) > >>>> * ethtool (out-of-band) > >>>> * ... > >>>> > >>>> The first acts on sockets, the second on netdevs. Then there's > >>>> ethtool > >>>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can > >>>> steer > >>>> to queues. Most users care about sockets and netdevices. Queues is > >>>> more of an implementation detail of Rx or for QoS on the Tx side. > >>> > >>> Let me first acknowledge that the current Linux tools to > >>> administrator > >>> HW filters is lacking (well sucks). We know the hardware is > >>> capable, > >>> as DPDK have an full API for this called rte_flow[1]. If nothing > >>> else > >>> you/we can use the DPDK API to create a program to configure the > >>> hardware, examples here[2] > >>> > >>> [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html > >>> [2] https://doc.dpdk.org/guides/howto/rte_flow.html > >>> > >>>> XDP is something that we can attach to a netdevice. Again, very > >>>> natural from a user perspective. As for XDP sockets, the current > >>>> mechanism is that we attach to an existing netdevice queue. Ideally > >>>> what we'd like is to *remove* the queue concept. A better approach > >>>> would be creating the socket and set it up -- but not binding it to > >>>> a > >>>> queue. Instead just binding it to a netdevice (or crazier just > >>>> creating a socket without a netdevice). > >> > >> You can remove the concept of a queue from the AF_XDP ABI (well, > >> extend > >> it to not require the queue being explicitly specified..), but you > >> can't > >> avoid the user space knowing there is a queue. Because if you do you > >> can no longer track and configure that queue (things like IRQ > >> moderation, descriptor count etc.) > > > > Yes exactly. Bjørn you mentioned leaky abstractions, and by removing > > the concept of a queue# from the AF_XDP ABI, then you have basically > > created a leaky abstraction, as the sysadm would need to > > tune/configure > > the "hidden" abstracted queue# (IRQ moderation, desc count etc.). > > > >> Currently the term "queue" refers mostly to the queues that stack > >> uses. > >> Which leaves e.g. the XDP TX queues in a strange grey zone (from > >> ethtool channel ABI perspective, RPS, XPS etc.) So it would be nice > >> to > >> have the HW queue ID somewhat detached from stack queue ID. Or at > >> least > >> it'd be nice to introduce queue types? I've been pondering this for > >> a > >> while, I don't see any silver bullet here.. > > > > Yes! - I also worry about the term "queue". This is very interesting > > to discuss. > > > > I do find it very natural that your HW (e.g. Netronome) have several > > HW > > RX-queues that feed/send to a single software NAPI RX-queue. (I > > assume > > this is how you HW already works, but software/Linux cannot know this > > internal HW queue id). How we expose and use this is interesting. > > > > I do want to be-able to create new RX-queues, semi-dynamically > > "setup"/load time. But still a limited number of RX-queues, for > > performance and memory reasons (TX-queue are cheaper). Memory as we > > prealloc memory RX-queue (and give it to HW). Performance as with too > > many queues, there is less chance to have a (RX) bulk of packets in > > queue. > > How would these be identified? Suppose there's a set of existing RX > queues for the device which handle the normal system traffic - then I > add an AF_XDP socket which gets its own dedicated RX queue. Does this > create a new queue id for the device? Create a new namespace with its > own queue id? > > The entire reason the user even cares about the queue id at all is > because it needs to use ethtool/netlink/tc for configuration, or the > net device's XDP program needs to differentiate between the queues > for specific treatment. > Exactly! I've been thinking along these lines as well -- I'd like to go the torwards a "AF_XDP with dedicated queues" model (in addition to the attach one). Then again, as Jesper and Jakub reminded me, XDP Tx is yet another inaccessible (from a configuration standpoint) set of queues. Maybe there is a need for proper "queues". Some are attached to the kernel stack, some to XDP Tx and some to AF_XDP sockets. That said, I like the good old netdevice and socket model for user applications. "There's a bunch of pipes. Some have HW backing, but I don't care much." :-P Björn > > > > For example I would not create an RX-queue per TCP-flow. But why do I > > still want per-queue XDP-progs and HW-filters for this TCP-flow > > use-case... let me explain: > > > > E.g. I want to implement an XDP TCP socket load-balancer (same host > > delivery, between XDP and network stack). And my goal is to avoid > > touching packet payload on XDP RX-CPU. First I configure ethtool > > filter to redirect all TCP port 80 to a specific RX-queue (could also > > be N-queues), then I don't need to parse TCP-port-80 in my per-queue > > BPF-prog, and I have higher chance of bulk-RX. Next I need HW to > > provide some flow-identifier, e.g. RSS-hash, flow-id or internal > > HW-queue-id, which I can use to redirect on (e.g. via CPUMAP to > > N-CPUs). This way I don't touch packet payload on RX-CPU (my bench > > shows one RX-CPU can handle between 14-20Mpps). > > > > > >>> Let me just remind everybody that the AF_XDP performance gains comes > >>> from binding the resource, which allow for lock-free semantics, as > >>> explained here[3]. > >>> > >>> [3] > >>> https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from > >>> > >>> > >>>> The socket is an endpoint, where I'd like data to end up (or get > >>>> sent > >>>> from). If the kernel can attach the socket to a hardware queue, > >>>> there's zerocopy if not, copy-mode. Dito for Tx. > >>> > >>> Well XDP programs per RXQ is just a building block to achieve this. > >>> > >>> As Van Jacobson explain[4], sockets or applications "register" a > >>> "transport signature", and gets back a "channel". In our case, the > >>> netdev-global XDP program is our way to register/program these > >>> transport > >>> signatures and redirect (e.g. into the AF_XDP socket). > >>> This requires some work in software to parse and match transport > >>> signatures to sockets. The XDP programs per RXQ is a way to get > >>> hardware to perform this filtering for us. > >>> > >>> [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf > >>> > >>> > >>>> Does a user (control plane) want/need to care about queues? Just > >>>> create a flow to a socket (out-of-band or inband) or to a netdevice > >>>> (out-of-band). > >>> > >>> A userspace "control-plane" program, could hide the setup and use > >>> what > >>> the system/hardware can provide of optimizations. VJ[4] e.g. > >>> suggest > >>> that the "listen" socket first register the transport signature > >>> (with > >>> the driver) on "accept()". If the HW supports DPDK-rte_flow API we > >>> can register a 5-tuple (or create TC-HW rules) and load our > >>> "transport-signature" XDP prog on the queue number we choose. If > >>> not, > >>> when our netdev-global XDP prog need a hash-table with 5-tuple and > >>> do > >>> 5-tuple parsing. > >> > >> But we do want the ability to configure the queue, and get stats for > >> that queue.. so we can't hide the queue completely, right? > > > > Yes, that is yet another example that the queue id "leak". > > > > > >>> Creating netdevices via HW filter into queues is an interesting > >>> idea. > >>> DPDK have an example here[5], on how to per flow (via ethtool filter > >>> setup even!) send packets to queues, that endup in SRIOV devices. > >>> > >>> [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html > >> > >> I wish I had the courage to nack the ethtool redirect to VF Intel > >> added :) > >> > >>>> Do we envison any other uses for per-queue XDP other than AF_XDP? > >>>> If > >>>> not, it would make *more* sense to attach the XDP program to the > >>>> socket (e.g. if the endpoint would like to use kernel data > >>>> structures > >>>> via XDP). > >>> > >>> As demonstrated in [5] you can use (ethtool) hardware filters to > >>> redirect packets into VFs (Virtual Functions). > >>> > >>> I also want us to extend XDP to allow for redirect from a PF > >>> (Physical > >>> Function) into a VF (Virtual Function). First the netdev-global > >>> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + > >>> VF > >>> info). Next configure HW filter to queue# and load XDP prog on that > >>> queue# that only "redirect" to a single VF. Now if driver+HW > >>> supports > >>> it, it can "eliminate" the per-queue XDP-prog and do everything in > >>> HW. > >> > >> That sounds slightly contrived. If the program is not doing > >> anything, > >> why involve XDP at all? > > > > If the HW doesn't support this then the XDP software will do the work. > > If the HW supports this, then you can still list the XDP-prog via > > bpftool, and see that you have a XDP prog that does this action (and > > maybe expose a offloaded-to-HW bit if you like to expose this info). > > > > > >> As stated above we already have too many ways > >> to do flow config and/or VF redirect. > >> > >>>> If we'd like to slice a netdevice into multiple queues. Isn't > >>>> macvlan > >>>> or similar *virtual* netdevices a better path, instead of > >>>> introducing > >>>> yet another abstraction? > >> > >> Yes, the question of use cases is extremely important. It seems > >> Mellanox is working on "spawning devlink ports" IOW slicing a device > >> into subdevices. Which is a great way to run bifurcated DPDK/netdev > >> applications :/ If that gets merged I think we have to recalculate > >> what purpose AF_XDP is going to serve, if any. > >> > >> In my view we have different "levels" of slicing: > > > > I do appreciate this overview of NIC slicing, as HW-filters + > > per-queue-XDP can be seen as a way to slice up the NIC. > > > >> (1) full HW device; > >> (2) software device (mdev?); > >> (3) separate netdev; > >> (4) separate "RSS instance"; > >> (5) dedicated application queues. > >> > >> 1 - is SR-IOV VFs > >> 2 - is software device slicing with mdev (Mellanox) > >> 3 - is (I think) Intel's VSI debugfs... "thing".. > >> 4 - is just ethtool RSS contexts (Solarflare) > >> 5 - is currently AF-XDP (Intel) > >> > >> (2) or lower is required to have raw register access allowing > >> vfio/DPDK > >> to run "natively". > >> > >> (3) or lower allows for full reuse of all networking APIs, with very > >> natural RSS configuration, TC/QoS configuration on TX etc. > >> > >> (5) is sufficient for zero copy. > >> > >> So back to the use case.. seems like AF_XDP is evolving into allowing > >> "level 3" to pass all frames directly to the application? With > >> optional XDP filtering? It's not a trick question - I'm just trying > >> to > >> place it somewhere on my mental map :) > > > > > >>> XDP redirect a more generic abstraction that allow us to implement > >>> macvlan. Except macvlan driver is missing ndo_xdp_xmit. Again first > >>> I > >>> write this as global-netdev XDP-prog, that does a lookup in a > >>> BPF-map. > >>> Next I configure HW filters that match the MAC-addr into a queue# > >>> and > >>> attach simpler XDP-prog to queue#, that redirect into macvlan > >>> device. > >>> > >>>> Further, is queue/socket a good abstraction for all devices? Wifi? > >> > >> Right, queue is no abstraction whatsoever. Queue is a low level > >> primitive. > > > > I agree, queue is a low level primitive. > > > > This the basically interface that the NIC hardware gave us... it is > > fairly limited as it can only express a queue id and a IRQ line that > > we > > can try to utilize to scale the system. Today, we have not really > > tapped into the potential of using this... instead we simply RSS-hash > > balance across all RX-queues and hope this makes the system scale... > > > > > >>>> By just viewing sockets as an endpoint, we leave it up to the > >>>> kernel to > >>>> figure out the best way. "Here's an endpoint. Give me data > >>>> **here**." > >>>> > >>>> The OpenFlow protocol does however support the concept of queues > >>>> per > >>>> port, but do we want to introduce that into the kernel? > >> > >> Switch queues != host queues. Switch/HW queues are for QoS, host > >> queues > >> are for RSS. Those two concepts are similar yet different. In Linux > >> if you offload basic TX TC (mq)prio (the old work John has done for > >> Intel) the actual number of HW queues becomes "channel count" x "num > >> TC > >> prios". What would queue ID mean for AF_XDP in that setup, I wonder. > > > > Thanks for explaining that. I must admit I never really understood the > > mqprio concept and these "prios" (when reading the code and playing > > with it). > > > > > >>>> So, if per-queue XDP programs is only for AF_XDP, I think it's > >>>> better > >>>> to stick the program to the socket. For me per-queue is sort of a > >>>> leaky abstraction... > >>>> > >>>> More thoughts. If we go the route of per-queue XDP programs. Would > >>>> it > >>>> be better to leave the setup to XDP -- i.e. the XDP program is > >>>> controlling the per-queue programs (think tail-calls, but a map > >>>> with > >>>> per-q programs). Instead of the netlink layer. This is part of a > >>>> bigger discussion, namely should XDP really implement the control > >>>> plane? > >>>> > >>>> I really like that a software switch/router can be implemented > >>>> effectively with XDP, but ideally I'd like it to be offloaded by > >>>> hardware -- using the same control/configuration plane. Can we do > >>>> it > >>>> in hardware, do that. If not, emulate via XDP. > >> > >> There is already a number of proposals in the "device application > >> slicing", it would be really great if we could make sure we don't > >> repeat the mistakes of flow configuration APIs, and try to prevent > >> having too many of them.. > >> > >> Which is very challenging unless we have strong use cases.. > >> > >>> That is actually the reason I want XDP per-queue, as it is a way to > >>> offload the filtering to the hardware. And if the per-queue > >>> XDP-prog > >>> becomes simple enough, the hardware can eliminate and do everything > >>> in > >>> hardware (hopefully). > >>> > >>>> The control plane should IMO be outside of the XDP program. > >> > >> ENOCOMPUTE :) XDP program is the BPF byte code, it's never control > >> plane. Do you mean application should not control the "context/ > >> channel/subdev" creation? You're not saying "it's not the XDP > >> program > >> which should be making the classification", no? XDP program > >> controlling the classification was _the_ reason why we liked AF_XDP > >> :) > > > > > > -- > > Best regards, > > Jesper Dangaard Brouer > > MSc.CS, Principal Kernel Engineer at Red Hat > > LinkedIn: http://www.linkedin.com/in/brouer