Lorenzo Bianconi <lorenzo@xxxxxxxxxx> writes:
Convert mlx5 driver to xdp_return_frame_bulk APIs.
XDP_REDIRECT (upstream codepath): 8.5Mpps
XDP_REDIRECT (upstream codepath + bulking APIs): 10.1Mpps
Signed-off-by: Lorenzo Bianconi <lorenzo@xxxxxxxxxx>
---
drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index ae90d533a350..5fdfbf390d5c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -369,8 +369,10 @@ static void mlx5e_free_xdpsq_desc(struct
mlx5e_xdpsq *sq,
bool recycle)
{
struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
+ struct xdp_frame_bulk bq;
u16 i;
+ bq.xa = NULL;
for (i = 0; i < wi->num_pkts; i++) {
struct mlx5e_xdp_info xdpi =
mlx5e_xdpi_fifo_pop(xdpi_fifo);
@@ -379,7 +381,7 @@ static void mlx5e_free_xdpsq_desc(struct
mlx5e_xdpsq *sq,
/* XDP_TX from the XSK RQ and XDP_REDIRECT
*/
dma_unmap_single(sq->pdev,
xdpi.frame.dma_addr,
xdpi.frame.xdpf->len,
DMA_TO_DEVICE);
- xdp_return_frame(xdpi.frame.xdpf);
+ xdp_return_frame_bulk(xdpi.frame.xdpf,
&bq);
break;
case MLX5E_XDP_XMIT_MODE_PAGE:
/* XDP_TX from the regular RQ */
@@ -393,6 +395,7 @@ static void mlx5e_free_xdpsq_desc(struct
mlx5e_xdpsq *sq,
WARN_ON_ONCE(true);
}
}
+ xdp_flush_frame_bulk(&bq);
While I understand the rational behind this patchset, using an
intermediate buffer
void *q[XDP_BULK_QUEUE_SIZE];
means more pressure on the data cache.
At the time I ran performance tests on mlx5 to see whether
batching skbs before passing them to GRO would improve
performance. On some flows I got worse performance.
This function seems to have less Dcache contention than RX flow,
but maybe some performance testing are needed here.
}
bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)