Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> writes: > This improves XDP_TX performance by about 8%. > > Here are single core XDP_TX test results. CPU consumptions are taken > from "perf report --no-child". > > - Before: > > 7.26 Mpps > > _raw_spin_lock 7.83% > veth_xdp_xmit 12.23% > > - After: > > 7.84 Mpps > > _raw_spin_lock 1.17% > veth_xdp_xmit 6.45% > > Signed-off-by: Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> > --- > drivers/net/veth.c | 26 +++++++++++++++++++++++++- > 1 file changed, 25 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > index 52110e5..4edc75f 100644 > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n, > return ret; > } > > +static void veth_xdp_flush_bq(struct net_device *dev) > +{ > + struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq); > + int sent, i, err = 0; > + > + sent = veth_xdp_xmit(dev, bq->count, bq->q, 0); Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So you're introducing an additional per-cpu bulk queue, only to avoid lock contention around the existing pointer ring. But the pointer ring is per-rq, so if you have lock contention, this means you must have multiple CPUs servicing the same rq, no? So why not just fix that instead? -Toke