Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX

Jason Wang <jasowang@xxxxxxxxxx> · Fri, 24 May 2019 11:54:00 +0800

On 2019/5/24 上午11:28, Toshiaki Makita wrote:

On 2019/05/24 12:13, Jason Wang wrote:

On 2019/5/23 下午9:51, Toshiaki Makita wrote:

On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:

On Thu, 23 May 2019 20:35:50 +0900
Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> wrote:


On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:

Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> writes:

This improves XDP_TX performance by about 8%.

Here are single core XDP_TX test results. CPU consumptions are taken
from "perf report --no-child".

- Before:

    7.26 Mpps

    _raw_spin_lock  7.83%
    veth_xdp_xmit  12.23%

- After:

    7.84 Mpps

    _raw_spin_lock  1.17%
    veth_xdp_xmit   6.45%

Signed-off-by: Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx>
---
   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
   1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 52110e5..4edc75f 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device
*dev, int n,
       return ret;
   }
   +static void veth_xdp_flush_bq(struct net_device *dev)
+{
+    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
+    int sent, i, err = 0;
+
+    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);

Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
you're introducing an additional per-cpu bulk queue, only to avoid
lock
contention around the existing pointer ring. But the pointer ring is
per-rq, so if you have lock contention, this means you must have
multiple CPUs servicing the same rq, no?

Yes, it's possible. Not recommended though.


I think the general per-cpu TX bulk queue is overkill.  There is a loop
over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
the caller veth_poll() will call veth_xdp_flush(rq->dev).

Why can't you store this "temp" bulk array in struct veth_rq ?

Of course I can. But I thought tun has the same problem and we can
decrease memory footprint by sharing the same storage between devices.


For TUN and for its fast path where vhost passes a bulk of XDP frames
(through msg_control) to us, we probably just need a temporary bulk
array in tun_xdp_one() instead of a global one. I can post patch or
maybe you if you're interested in this.

Of course you/I can. What I'm concerned is that could be waste of cache
line when softirq runs veth napi handler and then tun napi handler.




Well, technically the bulk queue passed to TUN could be reused. I admit 

it may save cacheline in ideal case but I wonder how much we could gain 

on real workload. (Note TUN doesn't use napi handler to do XDP, it has a 

NAPI mode but it was mainly used for hardening and XDP was not 

implemented there, maybe we should fix this).


Thanks



Thanks



Or if other devices want to reduce queues so that we can use XDP on
many-cpu servers and introduce locks, we can use this storage for that
case as well.

Still do you prefer veth-specific solution?


You could even alloc/create it on the stack of veth_poll() and send it
along via a pointer to veth_xdp_rcv).


Toshiaki Makita