Re: [RFC net] net/mlx5: Fix performance regression for request-response workloads

Alexandra Winter <wintera@xxxxxxxxxxxxx> · Mon, 26 Sep 2022 12:06:59 +0200

On 08.09.22 14:41, Eric Dumazet wrote:
> On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger
> <borntraeger@xxxxxxxxxxxxx> wrote:
>>
>> Am 07.09.22 um 18:06 schrieb Eric Dumazet:
>>> On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@xxxxxxxxxxxxx> wrote:
>>>>
>>>> Since linear payload was removed even for single small messages,
>>>> an additional page is required and we are measuring performance impact.
>>>>
>>>> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
>>>> explicitely allowed "payload in skb->head for first skb put in the queue,
>>>> to not impact RPC workloads."
>>>> 472c2e07eef0 ("tcp: add one skb cache for tx")
>>>> made that obsolete and removed it.
>>>> When
>>>> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
>>>> reverted it, this piece was not reverted and not added back in.
>>>>
>>>> When running uperf with a request-response pattern with 1k payload
>>>> and 250 connections parallel, we measure 13% difference in throughput
>>>> for our PCI based network interfaces since 472c2e07eef0.
>>>> (our IO MMU is sensitive to the number of mapped pages)
>>>
>>>
>>>
>>>>
>>>> Could you please consider allowing linear payload for the first
>>>> skb in queue again? A patch proposal is appended below.
>>>
>>> No.
>>>
>>> Please add a work around in your driver.
>>>
>>> You can increase throughput by 20% by premapping a coherent piece of
>>> memory in which
>>> you can copy small skbs (skb->head included)
>>>
>>> Something like 256 bytes per slot in the TX ring.
>>>
>>
>> FWIW this regression was withthe standard mellanox driver (nothing s390 specific).
> 
> I did not claim this was s390 specific.
> 
> Only IOMMU mode.
> 
> I would rather not add back something which makes TCP stack slower
> (more tests in fast path)
> for the majority of us _not_ using IOMMU.
> 
> In our own tests, this trick of using linear skbs was only helping
> benchmarks, not real workloads.
> 
> Many drivers have to map skb->head a second time if they contain TCP payload,
> thus adding yet another corner case in their fast path.
> 
> - Typical RPC workloads are playing with TCP_NODELAY
> - Typical bulk flows never have empty write queues...
> 
> Really, I do not want this optimization back, this is not worth it.
> 
> Again, a driver knows better if it is using IOMMU and if pathological
> layouts can be optimized
> to non SG ones, and using a pre-dma-map zone will also benefit pure
> TCP ACK packets (which do not have any payload)
> 
> Here is the changelog of a patch I did for our GQ NIC (not yet
> upstreamed, but will be soon)
> 
[...]

Saeed,
As discussed at LPC, could you please consider adding a workaround to the
Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
we are seeing 13% throughput degradation, if 2 pages need to be mapped
instead of 1.

While Eric's ideas sound very promising, just using non-SG in these cases
should be enough to mitigate the performance regression we see.

Thank you in advance.
Alexandra