On 10.12.24 14:54, Alexander Lobakin wrote: > From: Dragos Tatulea <dtatulea@xxxxxxxxxx> > Date: Tue, 10 Dec 2024 12:44:04 +0100 > >> >> >> On 06.12.24 16:20, Alexandra Winter wrote: >>> >>> >>> On 04.12.24 15:32, Alexander Lobakin wrote: >>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, >>>>> { >>>>> struct mlx5e_sq_stats *stats = sq->stats; >>>>> >>>>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ >>>>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) >>> + skb_linearize(skb); >>>> 1. What's with the direct DMA? I believe it would benefit, too? >>> >>> >>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact. >>> Any opinions from the NVidia people? >>> >> Agreed. >> >>> >>>> 2. Why truesize, not something like >>>> >>>> if (skb->len <= some_sane_value_maybe_1k) >>> >>> >>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page. >>> When we set the threshhold at a smaller value, skb->len makes more sense >>> >>> >>>> >>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think >>>> it's a good idea to rely on this. >>>> Some test-based hardcode would be enough (i.e. threshold on which >>>> DMA mapping starts performing better). >>> >>> >>> A threshhold of 4k is absolutely fine with us (s390). >>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios. >>> >>> >>> NVidia people do you have any opinion on a good threshhold? >>> On 09.12.24 12:36, Tariq Toukan wrote: > Hi, > > Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy"). > > In many cases, copy up to PAGE_SIZE means copy everything. > For high NIC speeds this is not realistic. > > Anyway, based on past experience, threshold should not exceed "max header size" (128/256b). >> 1KB is still to large. As Tariq mentioned, the threshold should not >> exceed 128/256B. I am currently testing this with 256B on x86. So far no >> regressions but I need to play with it more. > > On different setups, usually the copybreak of 192 or 256 bytes was the > most efficient as well. > >> Thank you very much, everybody for looking into this. Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390 and with this patch we get up to 12% more transactions/sec even for 1k messages. As we're always using an IOMMU and are operating with large multi socket systems, DMA costs far outweigh the costs of small to medium memcpy()s on our system. We realize that the recommendation is to just run without IOMMU when performance is important, but this is not an option in our environment. I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a strategic direction, it is more a simple mitigation of our problem. Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements. We could even restrict it to S390 arch, if you want. A threshold of 256b would cover some amount of our typical request-response workloads (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k). Could we maybe define an architecture dependent threshold? My preferred scenario for the next steps would be the following: 1) It would be great if we could get a simple mitigation patch upstream, that the distros could easily backport. This would avoid our customers experiencing performance regression when they upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example) 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed). This would then replace the simple mitigation patch upstream, and future releases would contain that solution. If somebody else wants to work on this improved version, fine with me, otherwise I could give it a try. What do you think of this approach?