On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote: > Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> writes: > >> This improves XDP_TX performance by about 8%. >> >> Here are single core XDP_TX test results. CPU consumptions are taken >> from "perf report --no-child". >> >> - Before: >> >> 7.26 Mpps >> >> _raw_spin_lock 7.83% >> veth_xdp_xmit 12.23% >> >> - After: >> >> 7.84 Mpps >> >> _raw_spin_lock 1.17% >> veth_xdp_xmit 6.45% >> >> Signed-off-by: Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> >> --- >> drivers/net/veth.c | 26 +++++++++++++++++++++++++- >> 1 file changed, 25 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/veth.c b/drivers/net/veth.c >> index 52110e5..4edc75f 100644 >> --- a/drivers/net/veth.c >> +++ b/drivers/net/veth.c >> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n, >> return ret; >> } >> >> +static void veth_xdp_flush_bq(struct net_device *dev) >> +{ >> + struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq); >> + int sent, i, err = 0; >> + >> + sent = veth_xdp_xmit(dev, bq->count, bq->q, 0); > > Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So > you're introducing an additional per-cpu bulk queue, only to avoid lock > contention around the existing pointer ring. But the pointer ring is > per-rq, so if you have lock contention, this means you must have > multiple CPUs servicing the same rq, no? Yes, it's possible. Not recommended though. > So why not just fix that > instead? The queues are shared with packets from stack sent from peer. That's because I needed the lock. I have tried to separate the queues, one for redirect and one for stack, but receiver side got too complicated and it ended up with worse performance. -- Toshiaki Makita