From: Alexandra Winter <wintera@xxxxxxxxxxxxx> Date: Wed, 4 Dec 2024 15:02:30 +0100 > Linearize the skb if the device uses IOMMU and the data buffer can fit > into one page. So messages can be transferred in one transfer to the card > instead of two. I'd expect this to be on the generic level, not copied over the drivers? Not sure about PAGE_SIZE, but I never saw a NIC/driver/platform where copying let's say 256 bytes would be slower than 2x dma_map (even with direct DMA). > > Performance issue: > ------------------ > Since commit 472c2e07eef0 ("tcp: add one skb cache for tx") > tcp skbs are always non-linear. Especially on platforms with IOMMU, > mapping and unmapping two pages instead of one per transfer can make a > noticeable difference. On s390 we saw a 13% degradation in throughput, > when running uperf with a request-response pattern with 1k payload and > 250 connections parallel. See [0] for a discussion. > > This patch mitigates these effects using a work-around in the mlx5 driver. > > Notes on implementation: > ------------------------ > TCP skbs never contain any tailroom, so skb_linearize() will allocate a > new data buffer. > No need to handle rc of skb_linearize(). If it fails, we continue with the > unchanged skb. > > As mentioned in the discussion, an alternative, but more invasive approach > would be: premapping a coherent piece of memory in which you can copy > small skbs. Yes, that one would be better. [...] > @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, > { > struct mlx5e_sq_stats *stats = sq->stats; > > + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ > + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) 1. What's with the direct DMA? I believe it would benefit, too? 2. Why truesize, not something like if (skb->len <= some_sane_value_maybe_1k) 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think it's a good idea to rely on this. Some test-based hardcode would be enough (i.e. threshold on which DMA mapping starts performing better). > + skb_linearize(skb); > + > if (skb_is_gso(skb)) { BTW can't there be a case when the skb is GSO, but its truesize is PAGE_SIZE and linearize will be way too slow (not sure it's possible, just guessing)? > int hopbyhop; > u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop); Thanks, Olek