Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> writes: > On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote: >> Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> writes: >> >>> This improves XDP_TX performance by about 8%. >>> >>> Here are single core XDP_TX test results. CPU consumptions are taken >>> from "perf report --no-child". >>> >>> - Before: >>> >>> 7.26 Mpps >>> >>> _raw_spin_lock 7.83% >>> veth_xdp_xmit 12.23% >>> >>> - After: >>> >>> 7.84 Mpps >>> >>> _raw_spin_lock 1.17% >>> veth_xdp_xmit 6.45% >>> >>> Signed-off-by: Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> >>> --- >>> drivers/net/veth.c | 26 +++++++++++++++++++++++++- >>> 1 file changed, 25 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c >>> index 52110e5..4edc75f 100644 >>> --- a/drivers/net/veth.c >>> +++ b/drivers/net/veth.c >>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n, >>> return ret; >>> } >>> >>> +static void veth_xdp_flush_bq(struct net_device *dev) >>> +{ >>> + struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq); >>> + int sent, i, err = 0; >>> + >>> + sent = veth_xdp_xmit(dev, bq->count, bq->q, 0); >> >> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So >> you're introducing an additional per-cpu bulk queue, only to avoid lock >> contention around the existing pointer ring. But the pointer ring is >> per-rq, so if you have lock contention, this means you must have >> multiple CPUs servicing the same rq, no? > > Yes, it's possible. Not recommended though. > >> So why not just fix that instead? > > The queues are shared with packets from stack sent from peer. That's > because I needed the lock. I have tried to separate the queues, one for > redirect and one for stack, but receiver side got too complicated and it > ended up with worse performance. I meant fix it with configuration. Now many receive queues are you running on the veth device in your benchmarks, and how have you configured the RPS? -Toke