Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb

Eric Dumazet <edumazet@xxxxxxxxxx> · Wed, 4 Dec 2024 15:16:48 +0100

On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@xxxxxxxxxxxxx> wrote:
>
> Linearize the skb if the device uses IOMMU and the data buffer can fit
> into one page. So messages can be transferred in one transfer to the card
> instead of two.
>
> Performance issue:
> ------------------
> Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
> tcp skbs are always non-linear. Especially on platforms with IOMMU,
> mapping and unmapping two pages instead of one per transfer can make a
> noticeable difference. On s390 we saw a 13% degradation in throughput,
> when running uperf with a request-response pattern with 1k payload and
> 250 connections parallel. See [0] for a discussion.
>
> This patch mitigates these effects using a work-around in the mlx5 driver.
>
> Notes on implementation:
> ------------------------
> TCP skbs never contain any tailroom, so skb_linearize() will allocate a
> new data buffer.
> No need to handle rc of skb_linearize(). If it fails, we continue with the
> unchanged skb.
>
> As mentioned in the discussion, an alternative, but more invasive approach
> would be: premapping a coherent piece of memory in which you can copy
> small skbs.
>
> Measurement results:
> --------------------
> We see an improvement in throughput of up to 16% compared to kernel v6.12.
> We measured throughput and CPU consumption of uperf benchmarks with
> ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
> with and without this patch.
>
> +------------------------------------------+
> | Transactions per Second - Deviation in % |
> +-------------------+----------------------+
> | Workload          |                      |
> |  rr1c-1x1--50     |          4.75        |
> |  rr1c-1x1-250     |         14.53        |
> | rr1c-200x1000--50 |          2.22        |
> | rr1c-200x1000-250 |         12.24        |
> +-------------------+----------------------+
> | Server CPU Consumption - Deviation in %  |
> +-------------------+----------------------+
> | Workload          |                      |
> |  rr1c-1x1--50     |         -1.66        |
> |  rr1c-1x1-250     |        -10.00        |
> | rr1c-200x1000--50 |         -0.83        |
> | rr1c-200x1000-250 |         -8.71        |
> +-------------------+----------------------+
>
> Note:
> - CPU consumption: less is better
> - Client CPU consumption is similar
> - Workload:
>   rr1c-<bytes send>x<bytes received>-<parallel connections>
>
>   Highly transactional small data sizes (rr1c-1x1)
>     This is a Request & Response (RR) test that sends a 1-byte request
>     from the client and receives a 1-byte response from the server. This
>     is the smallest possible transactional workload test and is smaller
>     than most customer workloads. This test represents the RR overhead
>     costs.
>   Highly transactional medium data sizes (rr1c-200x1000)
>     Request & Response (RR) test that sends a 200-byte request from the
>     client and receives a 1000-byte response from the server. This test
>     should be representative of a typical user's interaction with a remote
>     web site.
>
> Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@xxxxxxxxxxxxx/#t [0]
> Suggested-by: Rahul Rameshbabu <rrameshbabu@xxxxxxxxxx>
> Signed-off-by: Alexandra Winter <wintera@xxxxxxxxxxxxx>
> Co-developed-by: Nils Hoppmann <niho@xxxxxxxxxxxxx>
> Signed-off-by: Nils Hoppmann <niho@xxxxxxxxxxxxx>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> index f8c7912abe0e..421ba6798ca7 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> @@ -32,6 +32,7 @@
>
>  #include <linux/tcp.h>
>  #include <linux/if_vlan.h>
> +#include <linux/iommu-dma.h>
>  #include <net/geneve.h>
>  #include <net/dsfield.h>
>  #include "en.h"
> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>  {
>         struct mlx5e_sq_stats *stats = sq->stats;
>
> +       /* Don't require 2 IOMMU TLB entries, if one is sufficient */
> +       if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
> +               skb_linearize(skb);
> +
>         if (skb_is_gso(skb)) {
>                 int hopbyhop;
>                 u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
> --
> 2.45.2

Was this tested on x86_64 or any other arch than s390, especially ones
with PAGE_SIZE = 65536 ?