Hi,MPWQE apparently causes this in mlx5. MPWQE does not provide true zero-copy for packet sizes less than 256bytes. In order not to saturate the PCIe bus, MPWQE "copies" multiple small packets into one fixed-size memory block and then send them in one work queue entry which explains the degrading performance from packet sizes 64B to 256B.
https://github.com/torvalds/linux/blob/5aaef24b5c6d4246b2cac1be949869fa36577737/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h#L159MPWQE can be turned off using: ethtool --set-priv-flags enp2s0np0 xdp_tx_mpwqe off Performance reaches 25M pps on 64B and 128B, and 18M on 256 saturating a 40Gbps link.
Best, Jalal On 10/26/22 12:34, Mostafa, Jalal (IPE) wrote:
Hi,I am running the txonly microbenchmark from the AF_XDP-example sample program in bpf-examples. Howerver, I noticed strange performance with different tx packet sizes.With the following command I generate the below results: ./xdpsock -i enp2s0np0 -q 4 -t -b 2048 -m -s 128 -m Driver: mlx5 on 40Gbps ConnectX-6 Kernel version: 6.0.0-rc6+ | Pkt Size | Native-ZC PPS | Native-C PPS | Generic PPS || ----------- | -------------------- | ------------------ | ----------------- || 64 | 16.5M | 1.73M | 1.71M | | 128 | 9.42M | 1.72M | 1.66M | | 256 | 7.78M | 1.64M | 1.66M | | 512 | 9.39M | 1.62M | 1.59M | | 1024 | 4.78M | 1.42M | 1.38M |At size 128B, I expect 16.5M packets (the limiting performance of AF_XDP) since the link is not saturated. The problem is more obvious at size 256B, only 7.78B even though it jumps again to 9.39M at 512. So I think the problem is related to the packet size and not to a limited performance in the xsk engine in the kernel.Or what do you think? I have already opened an issue on github here: https://github.com/xdp-project/bpf-examples/issues/61 Best, Jalal