Linearize the skb if the device uses IOMMU and the data buffer can fit into one page. So messages can be transferred in one transfer to the card instead of two. Performance issue: ------------------ Since commit 472c2e07eef0 ("tcp: add one skb cache for tx") tcp skbs are always non-linear. Especially on platforms with IOMMU, mapping and unmapping two pages instead of one per transfer can make a noticeable difference. On s390 we saw a 13% degradation in throughput, when running uperf with a request-response pattern with 1k payload and 250 connections parallel. See [0] for a discussion. This patch mitigates these effects using a work-around in the mlx5 driver. Notes on implementation: ------------------------ TCP skbs never contain any tailroom, so skb_linearize() will allocate a new data buffer. No need to handle rc of skb_linearize(). If it fails, we continue with the unchanged skb. As mentioned in the discussion, an alternative, but more invasive approach would be: premapping a coherent piece of memory in which you can copy small skbs. Measurement results: -------------------- We see an improvement in throughput of up to 16% compared to kernel v6.12. We measured throughput and CPU consumption of uperf benchmarks with ConnectX-6 cards on s390 architecture and compared results of kernel v6.12 with and without this patch. +------------------------------------------+ | Transactions per Second - Deviation in % | +-------------------+----------------------+ | Workload | | | rr1c-1x1--50 | 4.75 | | rr1c-1x1-250 | 14.53 | | rr1c-200x1000--50 | 2.22 | | rr1c-200x1000-250 | 12.24 | +-------------------+----------------------+ | Server CPU Consumption - Deviation in % | +-------------------+----------------------+ | Workload | | | rr1c-1x1--50 | -1.66 | | rr1c-1x1-250 | -10.00 | | rr1c-200x1000--50 | -0.83 | | rr1c-200x1000-250 | -8.71 | +-------------------+----------------------+ Note: - CPU consumption: less is better - Client CPU consumption is similar - Workload: rr1c-<bytes send>x<bytes received>-<parallel connections> Highly transactional small data sizes (rr1c-1x1) This is a Request & Response (RR) test that sends a 1-byte request from the client and receives a 1-byte response from the server. This is the smallest possible transactional workload test and is smaller than most customer workloads. This test represents the RR overhead costs. Highly transactional medium data sizes (rr1c-200x1000) Request & Response (RR) test that sends a 200-byte request from the client and receives a 1000-byte response from the server. This test should be representative of a typical user's interaction with a remote web site. Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@xxxxxxxxxxxxx/#t [0] Suggested-by: Rahul Rameshbabu <rrameshbabu@xxxxxxxxxx> Signed-off-by: Alexandra Winter <wintera@xxxxxxxxxxxxx> Co-developed-by: Nils Hoppmann <niho@xxxxxxxxxxxxx> Signed-off-by: Nils Hoppmann <niho@xxxxxxxxxxxxx> --- drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c index f8c7912abe0e..421ba6798ca7 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c @@ -32,6 +32,7 @@ #include <linux/tcp.h> #include <linux/if_vlan.h> +#include <linux/iommu-dma.h> #include <net/geneve.h> #include <net/dsfield.h> #include "en.h" @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, { struct mlx5e_sq_stats *stats = sq->stats; + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) + skb_linearize(skb); + if (skb_is_gso(skb)) { int hopbyhop; u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop); -- 2.45.2