Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets

Jonathan Lemon <bsd@xxxxxx> · Mon, 13 May 2019 20:42:21 +0000

Tossing in my .02 cents:

I anticipate that most users of AF_XDP will want packet processing
for a given RX queue occurring on a single core - otherwise we end
up with cache delays.  The usual model is one thread, one socket,
one core, but this isn't enforced anywhere in the AF_XDP code and is
up to the user to set this up.

On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
> I'm not saying that we shouldn't do busy-poll. I'm saying it's
> complimentary, but in all cases single core per af_xdp rq queue
> with user thread pinning is preferred.

So I think we're on the same page here.

> Stack rx queues and af_xdp rx queues should look almost the same from
> napi point of view. Stack -> normal napi in softirq. af_xdp -> new 
> kthread
> to work with both poll and busy-poll. The only difference between
> poll and busy-poll will be the running context: new kthread vs user 
> task.
...
> A burst of 64 packets on stack queues or some other work in softirqd
> will spike the latency for af_xdp queues if softirq is shared.

True, but would it be shared?  This goes back to the current model, 
which
as used by Intel is:

    (channel == RX, TX, softirq)

MLX, on the other hand, wants:

    (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)

Which would indeed lead to sharing.  The more I look at the above, the
stronger I start to dislike it.  Perhaps this should be disallowed?

I believe there was some mention at LSF/MM that the 'channel' concept
was something specific to HW and really shouldn't be part of the SW API.

> Hence the proposal for new napi_kthreads:
> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be 
> exposed)
> - spawns kthread pinned to cpu X
> - configures irq for that af_xdp queue to fire on cpu X
> - user space with the help of libbpf pins its processing thread to 
> that cpu X
> - repeat above for as many af_xdp sockets as there as cpus
>   (its also ok to pick the same cpu X for different af_xdp socket
>   then new kthread is shared)
> - user space configures hw to RSS to these set of af_xdp sockets.
>   since ethtool api is a mess I propose to use af_xdp api to do this 
> rss config

 From a high level point of view, this sounds quite sensible, but does 
need
some details ironed out.  The model above essentially enforces a model 
of:

    (af_xdp = RX.af_xdp + bound_cpu)
      (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)

(temporarily ignoring TX for right now)

I forsee two issues with the above approach:
   1. hardware limitations in the number of queues/rings
   2. RSS/steering rules

> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be 
> exposed)

Here, the driver may not be able to create an arbitrary RQ, but may need 
to
tear down/reuse an existing one used by the stack.  This may not be an 
issue
for modern hardware.

> - user space configures hw to RSS to these set of af_xdp sockets.
>   since ethtool api is a mess I propose to use af_xdp api to do this 
> rss config

Currently, RSS only steers default traffic.  On a system with shared
stack/af_xdp queues, there should be a way to split the traffic types,
unless we're talking about a model where all traffic goes to AF_XDP.

This classification has to be done by the NIC, since it comes before RSS
steering - which currently means sending flow match rules to the NIC, 
which
is less than ideal.  I agree that the ethtool interface is non optimal, 
but
it does make things clear to the user what's going on.

Perhaps an af_xdp library that does some bookkeeping:
   - open af_xdp socket
   - define af_xdp_set as (classification, steering rules, other?)
   - bind socket to (cpu, af_xdp_set)
   - kernel:
     - pins calling thread to cpu
     - creates kthread if one doesn't exist, binds to irq and cpu
     - has driver create RQ.af_xdp, possibly replacing RQ.stack
     - applies (af_xdp_set) to NIC.

Seems workable, but a little complicated?  The complexity could be moved
into a separate library.

> imo that would be the simplest and performant way of using af_xdp.
> All configuration apis are under libbpf (or libxdp if we choose to 
> fork it)
> End result is one af_xdp rx queue - one napi - one kthread - one user 
> thread.
> All pinned to the same cpu with irq on that cpu.
> Both poll and busy-poll approaches will not bounce data between cpus.
> No 'shadow' queues to speak of and should solve the issues that
> folks were bringing up in different threads.

Sounds like a sensible model from my POV.
-- 
Jonathan