On Wed, Dec 4, 2024 at 3:16 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > > On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@xxxxxxxxxxxxx> wrote: > > > > Linearize the skb if the device uses IOMMU and the data buffer can fit > > into one page. So messages can be transferred in one transfer to the card > > instead of two. > > > > Performance issue: > > ------------------ > > Since commit 472c2e07eef0 ("tcp: add one skb cache for tx") > > tcp skbs are always non-linear. Especially on platforms with IOMMU, > > mapping and unmapping two pages instead of one per transfer can make a > > noticeable difference. On s390 we saw a 13% degradation in throughput, > > when running uperf with a request-response pattern with 1k payload and > > 250 connections parallel. See [0] for a discussion. > > > > This patch mitigates these effects using a work-around in the mlx5 driver. > > > > Notes on implementation: > > ------------------------ > > TCP skbs never contain any tailroom, so skb_linearize() will allocate a > > new data buffer. > > No need to handle rc of skb_linearize(). If it fails, we continue with the > > unchanged skb. > > > > As mentioned in the discussion, an alternative, but more invasive approach > > would be: premapping a coherent piece of memory in which you can copy > > small skbs. > > > > Measurement results: > > -------------------- > > We see an improvement in throughput of up to 16% compared to kernel v6.12. > > We measured throughput and CPU consumption of uperf benchmarks with > > ConnectX-6 cards on s390 architecture and compared results of kernel v6.12 > > with and without this patch. > > > > +------------------------------------------+ > > | Transactions per Second - Deviation in % | > > +-------------------+----------------------+ > > | Workload | | > > | rr1c-1x1--50 | 4.75 | > > | rr1c-1x1-250 | 14.53 | > > | rr1c-200x1000--50 | 2.22 | > > | rr1c-200x1000-250 | 12.24 | > > +-------------------+----------------------+ > > | Server CPU Consumption - Deviation in % | > > +-------------------+----------------------+ > > | Workload | | > > | rr1c-1x1--50 | -1.66 | > > | rr1c-1x1-250 | -10.00 | > > | rr1c-200x1000--50 | -0.83 | > > | rr1c-200x1000-250 | -8.71 | > > +-------------------+----------------------+ > > > > Note: > > - CPU consumption: less is better > > - Client CPU consumption is similar > > - Workload: > > rr1c-<bytes send>x<bytes received>-<parallel connections> > > > > Highly transactional small data sizes (rr1c-1x1) > > This is a Request & Response (RR) test that sends a 1-byte request > > from the client and receives a 1-byte response from the server. This > > is the smallest possible transactional workload test and is smaller > > than most customer workloads. This test represents the RR overhead > > costs. > > Highly transactional medium data sizes (rr1c-200x1000) > > Request & Response (RR) test that sends a 200-byte request from the > > client and receives a 1000-byte response from the server. This test > > should be representative of a typical user's interaction with a remote > > web site. > > > > Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@xxxxxxxxxxxxx/#t [0] > > Suggested-by: Rahul Rameshbabu <rrameshbabu@xxxxxxxxxx> > > Signed-off-by: Alexandra Winter <wintera@xxxxxxxxxxxxx> > > Co-developed-by: Nils Hoppmann <niho@xxxxxxxxxxxxx> > > Signed-off-by: Nils Hoppmann <niho@xxxxxxxxxxxxx> > > --- > > drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c > > index f8c7912abe0e..421ba6798ca7 100644 > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c > > @@ -32,6 +32,7 @@ > > > > #include <linux/tcp.h> > > #include <linux/if_vlan.h> > > +#include <linux/iommu-dma.h> > > #include <net/geneve.h> > > #include <net/dsfield.h> > > #include "en.h" > > @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, > > { > > struct mlx5e_sq_stats *stats = sq->stats; > > > > + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ > > + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) > > + skb_linearize(skb); > > + > > if (skb_is_gso(skb)) { > > int hopbyhop; > > u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop); > > -- > > 2.45.2 > > > Was this tested on x86_64 or any other arch than s390, especially ones > with PAGE_SIZE = 65536 ? I would suggest the opposite : copy the headers (typically less than 128 bytes) on a piece of coherent memory. As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb. include/net/tso.h and net/core/tso.c users do this. Sure, patch is going to be more invasive, but all arches will win.