Re: Per-queue XDP programs, thoughts

Jakub Kicinski <jakub.kicinski@xxxxxxxxxxxxx> · Mon, 15 Apr 2019 15:49:32 -0700

On Mon, 15 Apr 2019 18:32:58 +0200, Jesper Dangaard Brouer wrote:
> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@xxxxxxxxx> wrote:
> > Hi,
> > 
> > As you probably can derive from the amount of time this is taking, I'm
> > not really satisfied with the design of per-queue XDP program. (That,
> > plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
> > in this mail!

Jesper was advocating per-queue progs since very early days of XDP.
If it was easy to implement cleanly we would've already gotten it ;)

> > Beware, it's kind of a long post, and it's all over the place.  
> 
> Cc'ing all the XDP-maintainers (and netdev).
> 
> > There are a number of ways of setting up flows in the kernel, e.g.
> > 
> > * Connecting/accepting a TCP socket (in-band)
> > * Using tc-flower (out-of-band)
> > * ethtool (out-of-band)
> > * ...
> > 
> > The first acts on sockets, the second on netdevs. Then there's ethtool
> > to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
> > to queues. Most users care about sockets and netdevices. Queues is
> > more of an implementation detail of Rx or for QoS on the Tx side.  
> 
> Let me first acknowledge that the current Linux tools to administrator
> HW filters is lacking (well sucks).  We know the hardware is capable,
> as DPDK have an full API for this called rte_flow[1]. If nothing else
> you/we can use the DPDK API to create a program to configure the
> hardware, examples here[2]
> 
>  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
> 
> > XDP is something that we can attach to a netdevice. Again, very
> > natural from a user perspective. As for XDP sockets, the current
> > mechanism is that we attach to an existing netdevice queue. Ideally
> > what we'd like is to *remove* the queue concept. A better approach
> > would be creating the socket and set it up -- but not binding it to a
> > queue. Instead just binding it to a netdevice (or crazier just
> > creating a socket without a netdevice).  

You can remove the concept of a queue from the AF_XDP ABI (well, extend
it to not require the queue being explicitly specified..), but you can't
avoid the user space knowing there is a queue.  Because if you do you
can no longer track and configure that queue (things like IRQ
moderation, descriptor count etc.)

Currently the term "queue" refers mostly to the queues that stack uses.
Which leaves e.g. the XDP TX queues in a strange grey zone (from
ethtool channel ABI perspective, RPS, XPS etc.)  So it would be nice to
have the HW queue ID somewhat detached from stack queue ID.  Or at least
it'd be nice to introduce queue types?  I've been pondering this for a
while, I don't see any silver bullet here..

> Let me just remind everybody that the AF_XDP performance gains comes
> from binding the resource, which allow for lock-free semantics, as
> explained here[3].
> 
> [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
> 
> 
> > The socket is an endpoint, where I'd like data to end up (or get sent
> > from). If the kernel can attach the socket to a hardware queue,
> > there's zerocopy if not, copy-mode. Dito for Tx.  
> 
> Well XDP programs per RXQ is just a building block to achieve this.
> 
> As Van Jacobson explain[4], sockets or applications "register" a
> "transport signature", and gets back a "channel".   In our case, the
> netdev-global XDP program is our way to register/program these transport
> signatures and redirect (e.g. into the AF_XDP socket).
> This requires some work in software to parse and match transport
> signatures to sockets.  The XDP programs per RXQ is a way to get
> hardware to perform this filtering for us.
> 
>  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
> 
> 
> > Does a user (control plane) want/need to care about queues? Just
> > create a flow to a socket (out-of-band or inband) or to a netdevice
> > (out-of-band).  
> 
> A userspace "control-plane" program, could hide the setup and use what
> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> that the "listen" socket first register the transport signature (with
> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> can register a 5-tuple (or create TC-HW rules) and load our
> "transport-signature" XDP prog on the queue number we choose.  If not,
> when our netdev-global XDP prog need a hash-table with 5-tuple and do
> 5-tuple parsing.

But we do want the ability to configure the queue, and get stats for
that queue.. so we can't hide the queue completely, right?

> Creating netdevices via HW filter into queues is an interesting idea.
> DPDK have an example here[5], on how to per flow (via ethtool filter
> setup even!) send packets to queues, that endup in SRIOV devices.
> 
>  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html

I wish I had the courage to nack the ethtool redirect to VF Intel
added :)

> > Do we envison any other uses for per-queue XDP other than AF_XDP? If
> > not, it would make *more* sense to attach the XDP program to the
> > socket (e.g. if the endpoint would like to use kernel data structures
> > via XDP).  
> 
> As demonstrated in [5] you can use (ethtool) hardware filters to
> redirect packets into VFs (Virtual Functions).
> 
> I also want us to extend XDP to allow for redirect from a PF (Physical
> Function) into a VF (Virtual Function).  First the netdev-global
> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
> info).  Next configure HW filter to queue# and load XDP prog on that
> queue# that only "redirect" to a single VF.  Now if driver+HW supports
> it, it can "eliminate" the per-queue XDP-prog and do everything in HW. 

That sounds slightly contrived.  If the program is not doing anything,
why involve XDP at all?  As stated above we already have too many ways
to do flow config and/or VF redirect.

> > If we'd like to slice a netdevice into multiple queues. Isn't macvlan
> > or similar *virtual* netdevices a better path, instead of introducing
> > yet another abstraction?  

Yes, the question of use cases is extremely important.  It seems
Mellanox is working on "spawning devlink ports" IOW slicing a device
into subdevices.  Which is a great way to run bifurcated DPDK/netdev
applications :/  If that gets merged I think we have to recalculate
what purpose AF_XDP is going to serve, if any.

In my view we have different "levels" of slicing:

 (1) full HW device;
 (2) software device (mdev?);
 (3) separate netdev;
 (4) separate "RSS instance";
 (5) dedicated application queues.

1 - is SR-IOV VFs
2 - is software device slicing with mdev (Mellanox)
3 - is (I think) Intel's VSI debugfs... "thing"..
4 - is just ethtool RSS contexts (Solarflare)
5 - is currently AF-XDP (Intel)

(2) or lower is required to have raw register access allowing vfio/DPDK
to run "natively".

(3) or lower allows for full reuse of all networking APIs, with very
natural RSS configuration, TC/QoS configuration on TX etc.

(5) is sufficient for zero copy.

So back to the use case.. seems like AF_XDP is evolving into allowing
"level 3" to pass all frames directly to the application?  With
optional XDP filtering?  It's not a trick question - I'm just trying to
place it somewhere on my mental map :)

> XDP redirect a more generic abstraction that allow us to implement
> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
> write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
> Next I configure HW filters that match the MAC-addr into a queue# and
> attach simpler XDP-prog to queue#, that redirect into macvlan device.
> 
> > Further, is queue/socket a good abstraction for all devices? Wifi? 

Right, queue is no abstraction whatsoever.  Queue is a low level
primitive.

> > By just viewing sockets as an endpoint, we leave it up to the kernel to
> > figure out the best way. "Here's an endpoint. Give me data **here**."
> > 
> > The OpenFlow protocol does however support the concept of queues per
> > port, but do we want to introduce that into the kernel?

Switch queues != host queues.  Switch/HW queues are for QoS, host queues
are for RSS.  Those two concepts are similar yet different.  In Linux
if you offload basic TX TC (mq)prio (the old work John has done for
Intel) the actual number of HW queues becomes "channel count" x "num TC
prios".  What would queue ID mean for AF_XDP in that setup, I wonder.

> > So, if per-queue XDP programs is only for AF_XDP, I think it's better
> > to stick the program to the socket. For me per-queue is sort of a
> > leaky abstraction...
> >
> > More thoughts. If we go the route of per-queue XDP programs. Would it
> > be better to leave the setup to XDP -- i.e. the XDP program is
> > controlling the per-queue programs (think tail-calls, but a map with
> > per-q programs). Instead of the netlink layer. This is part of a
> > bigger discussion, namely should XDP really implement the control
> > plane?
> >
> > I really like that a software switch/router can be implemented
> > effectively with XDP, but ideally I'd like it to be offloaded by
> > hardware -- using the same control/configuration plane. Can we do it
> > in hardware, do that. If not, emulate via XDP.  

There is already a number of proposals in the "device application
slicing", it would be really great if we could make sure we don't
repeat the mistakes of flow configuration APIs, and try to prevent
having too many of them..

Which is very challenging unless we have strong use cases..

> That is actually the reason I want XDP per-queue, as it is a way to
> offload the filtering to the hardware.  And if the per-queue XDP-prog
> becomes simple enough, the hardware can eliminate and do everything in
> hardware (hopefully).
> 
> > The control plane should IMO be outside of the XDP program.

ENOCOMPUTE :)  XDP program is the BPF byte code, it's never control
plance.  Do you mean application should not control the "context/
channel/subdev" creation?  You're not saying "it's not the XDP program
which should be making the classification", no?  XDP program
controlling the classification was _the_ reason why we liked AF_XDP :)