Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb

Dragos Tatulea <dtatulea@xxxxxxxxxx> · Wed, 11 Dec 2024 18:28:19 +0100

Hi Alexandra,

On 11.12.24 14:35, Alexandra Winter wrote:
> 
> 
> On 10.12.24 14:54, Alexander Lobakin wrote:
>> From: Dragos Tatulea <dtatulea@xxxxxxxxxx>
>> Date: Tue, 10 Dec 2024 12:44:04 +0100
>>
>>>
>>>
>>> On 06.12.24 16:20, Alexandra Winter wrote:
>>>>
>>>>
>>>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>>>  {
>>>>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>>>>  
>>>>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>>>    +		skb_linearize(skb);
>>>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>>>
>>>>
>>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>>>> Any opinions from the NVidia people?
>>>>
>>> Agreed.
>>>
>>>>
>>>>> 2. Why truesize, not something like
>>>>>
>>>>> 	if (skb->len <= some_sane_value_maybe_1k)
>>>>
>>>>
>>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>>>> When we set the threshhold at a smaller value, skb->len makes more sense
>>>>
>>>>
>>>>>
>>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>>>    it's a good idea to rely on this.
>>>>>    Some test-based hardcode would be enough (i.e. threshold on which
>>>>>    DMA mapping starts performing better).
>>>>
>>>>
>>>> A threshhold of 4k is absolutely fine with us (s390). 
>>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>>>
>>>>
>>>> NVidia people do you have any opinion on a good threshhold?
>>>>
> 
> On 09.12.24 12:36, Tariq Toukan wrote:
>> Hi,
>>
>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
>>
>> In many cases, copy up to PAGE_SIZE means copy everything.
>> For high NIC speeds this is not realistic.
>>
>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
> 
>>> 1KB is still to large. As Tariq mentioned, the threshold should not
>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>>> regressions but I need to play with it more.
I checked on a x86 system with CX7 and we seem to get ~4% degradation
when using this approach. The patch was modified a bit according to
previous discussions (diff at end of mail).

Here's how I tested:
- uperf client side has many queues.
- uperf server side has single queue with interrupts pinned to a single
  CPU. This is to better isolate CPU behaviour. The idea is to have the
  CPU on the server side saturated or close to saturation.
- Used the uperf 1B read x 1B write scenario with 50 and 100 threads
  (profile attached).
  Both the on-cpu and off-cpu cases were checked.
- Code change was done only on server side.

The results:
```
Case:                                          Throughput   Operations
----------------------------------------------------------------------
app cpu == irq cpu, nthreads= 25, baseline     3.86Mb/s     30036552  
app cpu == irq cpu, nthreads= 25, skb_linear   3.52Mb/s     26969315  

app cpu == irq cpu, nthreads= 50, baseline     4.26Mb/s     33122458 
app cpu == irq cpu, nthreads= 50, skb_linear   4.06Mb/s     31606290 

app cpu == irq cpu, nthreads=100, baseline     4.08Mb/s     31775434  
app cpu == irq cpu, nthreads=100, skb_linear   3.86Mb/s     30105582

app cpu != irq cpu, nthreads= 25, baseline     3.57Mb/s     27785488
app cpu != irq cpu, nthreads= 25, skb_linear   3.56Mb/s     27730133

app cpu != irq cpu, nthreads= 50, baseline     3.97Mb/s     30876264
app cpu != irq cpu, nthreads= 50, skb_linear   3.82Mb/s     29737654

app cpu != irq cpu, nthreads=100, baseline     4.06Mb/s     31621140
app cpu != irq cpu, nthreads=100, skb_linear   3.90Mb/s     30364382
```

So not encouraging. I restricted the tests to 1B payloads only as I
expected worse perf for larger payloads.

>>
>> On different setups, usually the copybreak of 192 or 256 bytes was the
>> most efficient as well.
>>
>>>
> 
> 
> Thank you very much, everybody for looking into this.
> 
> Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390
> and with this patch we get up to 12% more transactions/sec even for 1k messages.
> 
> As we're always using an IOMMU and are operating with large multi socket systems,
> DMA costs far outweigh the costs of small to medium memcpy()s on our system.
> We realize that the recommendation is to just run without IOMMU when performance is important,
> but this is not an option in our environment.
> 
> I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a
> strategic direction, it is more a simple mitigation of our problem.
> 
> Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements.
> We could even restrict it to S390 arch, if you want.
>
Maybe Tariq can comment on this.

> A threshold of 256b would cover some amount of our typical request-response workloads
> (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k).
> Could we maybe define an architecture dependent threshold?
> 
> My preferred scenario for the next steps would be the following:
> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could
> easily backport. This would avoid our customers experiencing performance regression when they
> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)
>
Stupid question on my behalf: why can't this patch be taken as a distro
patch for s390 and carried over over releases? This way the kernel
upgrade pain would be avoided.

> 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed).
> This would then replace the simple mitigation patch upstream, and future releases would contain 
> that solution. If somebody else wants to work on this improved version, fine with me, otherwise
> I could give it a try.
> 
For the inline WQE solution I don't think we have a large amout of space
to copy so much data into. For Eric's side buffer suggestion it will be
even more invasive and it will touch many more code paths...

> What do you think of this approach?
> 
>
Sounds tricky. Let's see what Tariq has to say.

Thanks,
Dragos

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..cc947daa538b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -269,6 +269,9 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 {
        struct mlx5e_sq_stats *stats = sq->stats;
 
+       if (skb->len < 256)
+               skb_linearize(skb);
+
        if (skb_is_gso(skb)) {
                int hopbyhop;
                u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
<?xml version="1.0"?>
<profile name="netperf">
  <group nthreads="$nthreads">
        <transaction iterations="$iter">
            <flowop type="accept" options="remotehost=$h1 protocol=$proto tcp_nodelay"/>
        </transaction>
        <transaction duration="$dur">
            <flowop type="read" options="size=1"/>
            <flowop type="write" options="size=1"/>
        </transaction>
        <transaction iterations="$iter">
            <flowop type="disconnect" />
        </transaction>
  </group>
</profile>