On 19/05/23 (木) 21:18:25, Toke Høiland-Jørgensen wrote:
Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> writes:
On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> writes:
This improves XDP_TX performance by about 8%.
Here are single core XDP_TX test results. CPU consumptions are taken
from "perf report --no-child".
- Before:
7.26 Mpps
_raw_spin_lock 7.83%
veth_xdp_xmit 12.23%
- After:
7.84 Mpps
_raw_spin_lock 1.17%
veth_xdp_xmit 6.45%
Signed-off-by: Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx>
---
drivers/net/veth.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 52110e5..4edc75f 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
return ret;
}
+static void veth_xdp_flush_bq(struct net_device *dev)
+{
+ struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
+ int sent, i, err = 0;
+
+ sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
you're introducing an additional per-cpu bulk queue, only to avoid lock
contention around the existing pointer ring. But the pointer ring is
per-rq, so if you have lock contention, this means you must have
multiple CPUs servicing the same rq, no?
Yes, it's possible. Not recommended though.
So why not just fix that instead?
The queues are shared with packets from stack sent from peer. That's
because I needed the lock. I have tried to separate the queues, one for
redirect and one for stack, but receiver side got too complicated and it
ended up with worse performance.
I meant fix it with configuration. Now many receive queues are you
running on the veth device in your benchmarks, and how have you
configured the RPS?
As I wrote this test is a single queue test and does not have any
contention.
Per packet lock has some overhead even in that configuration.
Toshiaki Makita