[PATCH net-next] net/mlx5e: Transmit small messages in linear skb

Alexandra Winter <wintera@xxxxxxxxxxxxx> · Wed, 4 Dec 2024 15:02:30 +0100

Linearize the skb if the device uses IOMMU and the data buffer can fit
into one page. So messages can be transferred in one transfer to the card
instead of two.

Performance issue:
------------------
Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
tcp skbs are always non-linear. Especially on platforms with IOMMU,
mapping and unmapping two pages instead of one per transfer can make a
noticeable difference. On s390 we saw a 13% degradation in throughput,
when running uperf with a request-response pattern with 1k payload and
250 connections parallel. See [0] for a discussion.

This patch mitigates these effects using a work-around in the mlx5 driver.

Notes on implementation:
------------------------
TCP skbs never contain any tailroom, so skb_linearize() will allocate a
new data buffer.
No need to handle rc of skb_linearize(). If it fails, we continue with the
unchanged skb.

As mentioned in the discussion, an alternative, but more invasive approach
would be: premapping a coherent piece of memory in which you can copy
small skbs.

Measurement results:
--------------------
We see an improvement in throughput of up to 16% compared to kernel v6.12.
We measured throughput and CPU consumption of uperf benchmarks with
ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
with and without this patch.

+------------------------------------------+
| Transactions per Second - Deviation in % |
+-------------------+----------------------+
| Workload          |                      |
|  rr1c-1x1--50     |          4.75        |
|  rr1c-1x1-250     |         14.53        |
| rr1c-200x1000--50 |          2.22        |
| rr1c-200x1000-250 |         12.24        |
+-------------------+----------------------+
| Server CPU Consumption - Deviation in %  |
+-------------------+----------------------+
| Workload          |                      |
|  rr1c-1x1--50     |         -1.66        |
|  rr1c-1x1-250     |        -10.00        |
| rr1c-200x1000--50 |         -0.83        |
| rr1c-200x1000-250 |         -8.71        |
+-------------------+----------------------+

Note:
- CPU consumption: less is better
- Client CPU consumption is similar
- Workload:
  rr1c-<bytes send>x<bytes received>-<parallel connections>

  Highly transactional small data sizes (rr1c-1x1)
    This is a Request & Response (RR) test that sends a 1-byte request
    from the client and receives a 1-byte response from the server. This
    is the smallest possible transactional workload test and is smaller
    than most customer workloads. This test represents the RR overhead
    costs.
  Highly transactional medium data sizes (rr1c-200x1000)
    Request & Response (RR) test that sends a 200-byte request from the
    client and receives a 1000-byte response from the server. This test
    should be representative of a typical user's interaction with a remote
    web site.

Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@xxxxxxxxxxxxx/#t [0]
Suggested-by: Rahul Rameshbabu <rrameshbabu@xxxxxxxxxx>
Signed-off-by: Alexandra Winter <wintera@xxxxxxxxxxxxx>
Co-developed-by: Nils Hoppmann <niho@xxxxxxxxxxxxx>
Signed-off-by: Nils Hoppmann <niho@xxxxxxxxxxxxx>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..421ba6798ca7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -32,6 +32,7 @@
 
 #include <linux/tcp.h>
 #include <linux/if_vlan.h>
+#include <linux/iommu-dma.h>
 #include <net/geneve.h>
 #include <net/dsfield.h>
 #include "en.h"
@@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 {
 	struct mlx5e_sq_stats *stats = sq->stats;
 
+	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
+	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
+		skb_linearize(skb);
+
 	if (skb_is_gso(skb)) {
 		int hopbyhop;
 		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
-- 
2.45.2