Hi Alexandra, On 11.12.24 14:35, Alexandra Winter wrote: > > > On 10.12.24 14:54, Alexander Lobakin wrote: >> From: Dragos Tatulea <dtatulea@xxxxxxxxxx> >> Date: Tue, 10 Dec 2024 12:44:04 +0100 >> >>> >>> >>> On 06.12.24 16:20, Alexandra Winter wrote: >>>> >>>> >>>> On 04.12.24 15:32, Alexander Lobakin wrote: >>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, >>>>>> { >>>>>> struct mlx5e_sq_stats *stats = sq->stats; >>>>>> >>>>>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ >>>>>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) >>>> + skb_linearize(skb); >>>>> 1. What's with the direct DMA? I believe it would benefit, too? >>>> >>>> >>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact. >>>> Any opinions from the NVidia people? >>>> >>> Agreed. >>> >>>> >>>>> 2. Why truesize, not something like >>>>> >>>>> if (skb->len <= some_sane_value_maybe_1k) >>>> >>>> >>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page. >>>> When we set the threshhold at a smaller value, skb->len makes more sense >>>> >>>> >>>>> >>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think >>>>> it's a good idea to rely on this. >>>>> Some test-based hardcode would be enough (i.e. threshold on which >>>>> DMA mapping starts performing better). >>>> >>>> >>>> A threshhold of 4k is absolutely fine with us (s390). >>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios. >>>> >>>> >>>> NVidia people do you have any opinion on a good threshhold? >>>> > > On 09.12.24 12:36, Tariq Toukan wrote: >> Hi, >> >> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy"). >> >> In many cases, copy up to PAGE_SIZE means copy everything. >> For high NIC speeds this is not realistic. >> >> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b). > >>> 1KB is still to large. As Tariq mentioned, the threshold should not >>> exceed 128/256B. I am currently testing this with 256B on x86. So far no >>> regressions but I need to play with it more. I checked on a x86 system with CX7 and we seem to get ~4% degradation when using this approach. The patch was modified a bit according to previous discussions (diff at end of mail). Here's how I tested: - uperf client side has many queues. - uperf server side has single queue with interrupts pinned to a single CPU. This is to better isolate CPU behaviour. The idea is to have the CPU on the server side saturated or close to saturation. - Used the uperf 1B read x 1B write scenario with 50 and 100 threads (profile attached). Both the on-cpu and off-cpu cases were checked. - Code change was done only on server side. The results: ``` Case: Throughput Operations ---------------------------------------------------------------------- app cpu == irq cpu, nthreads= 25, baseline 3.86Mb/s 30036552 app cpu == irq cpu, nthreads= 25, skb_linear 3.52Mb/s 26969315 app cpu == irq cpu, nthreads= 50, baseline 4.26Mb/s 33122458 app cpu == irq cpu, nthreads= 50, skb_linear 4.06Mb/s 31606290 app cpu == irq cpu, nthreads=100, baseline 4.08Mb/s 31775434 app cpu == irq cpu, nthreads=100, skb_linear 3.86Mb/s 30105582 app cpu != irq cpu, nthreads= 25, baseline 3.57Mb/s 27785488 app cpu != irq cpu, nthreads= 25, skb_linear 3.56Mb/s 27730133 app cpu != irq cpu, nthreads= 50, baseline 3.97Mb/s 30876264 app cpu != irq cpu, nthreads= 50, skb_linear 3.82Mb/s 29737654 app cpu != irq cpu, nthreads=100, baseline 4.06Mb/s 31621140 app cpu != irq cpu, nthreads=100, skb_linear 3.90Mb/s 30364382 ``` So not encouraging. I restricted the tests to 1B payloads only as I expected worse perf for larger payloads. >> >> On different setups, usually the copybreak of 192 or 256 bytes was the >> most efficient as well. >> >>> > > > Thank you very much, everybody for looking into this. > > Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390 > and with this patch we get up to 12% more transactions/sec even for 1k messages. > > As we're always using an IOMMU and are operating with large multi socket systems, > DMA costs far outweigh the costs of small to medium memcpy()s on our system. > We realize that the recommendation is to just run without IOMMU when performance is important, > but this is not an option in our environment. > > I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a > strategic direction, it is more a simple mitigation of our problem. > > Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements. > We could even restrict it to S390 arch, if you want. > Maybe Tariq can comment on this. > A threshold of 256b would cover some amount of our typical request-response workloads > (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k). > Could we maybe define an architecture dependent threshold? > > My preferred scenario for the next steps would be the following: > 1) It would be great if we could get a simple mitigation patch upstream, that the distros could > easily backport. This would avoid our customers experiencing performance regression when they > upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example) > Stupid question on my behalf: why can't this patch be taken as a distro patch for s390 and carried over over releases? This way the kernel upgrade pain would be avoided. > 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed). > This would then replace the simple mitigation patch upstream, and future releases would contain > that solution. If somebody else wants to work on this improved version, fine with me, otherwise > I could give it a try. > For the inline WQE solution I don't think we have a large amout of space to copy so much data into. For Eric's side buffer suggestion it will be even more invasive and it will touch many more code paths... > What do you think of this approach? > > Sounds tricky. Let's see what Tariq has to say. Thanks, Dragos diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c index f8c7912abe0e..cc947daa538b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c @@ -269,6 +269,9 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, { struct mlx5e_sq_stats *stats = sq->stats; + if (skb->len < 256) + skb_linearize(skb); + if (skb_is_gso(skb)) { int hopbyhop; u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
<?xml version="1.0"?> <profile name="netperf"> <group nthreads="$nthreads"> <transaction iterations="$iter"> <flowop type="accept" options="remotehost=$h1 protocol=$proto tcp_nodelay"/> </transaction> <transaction duration="$dur"> <flowop type="read" options="size=1"/> <flowop type="write" options="size=1"/> </transaction> <transaction iterations="$iter"> <flowop type="disconnect" /> </transaction> </group> </profile>