Re: [RFC net] tcp: Fix performance regression for request-response workloads

Eric Dumazet <edumazet@xxxxxxxxxx> · Thu, 8 Sep 2022 05:41:12 -0700

On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger
<borntraeger@xxxxxxxxxxxxx> wrote:
>
> Am 07.09.22 um 18:06 schrieb Eric Dumazet:
> > On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@xxxxxxxxxxxxx> wrote:
> >>
> >> Since linear payload was removed even for single small messages,
> >> an additional page is required and we are measuring performance impact.
> >>
> >> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
> >> explicitely allowed "payload in skb->head for first skb put in the queue,
> >> to not impact RPC workloads."
> >> 472c2e07eef0 ("tcp: add one skb cache for tx")
> >> made that obsolete and removed it.
> >> When
> >> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
> >> reverted it, this piece was not reverted and not added back in.
> >>
> >> When running uperf with a request-response pattern with 1k payload
> >> and 250 connections parallel, we measure 13% difference in throughput
> >> for our PCI based network interfaces since 472c2e07eef0.
> >> (our IO MMU is sensitive to the number of mapped pages)
> >
> >
> >
> >>
> >> Could you please consider allowing linear payload for the first
> >> skb in queue again? A patch proposal is appended below.
> >
> > No.
> >
> > Please add a work around in your driver.
> >
> > You can increase throughput by 20% by premapping a coherent piece of
> > memory in which
> > you can copy small skbs (skb->head included)
> >
> > Something like 256 bytes per slot in the TX ring.
> >
>
> FWIW this regression was withthe standard mellanox driver (nothing s390 specific).

I did not claim this was s390 specific.

Only IOMMU mode.

I would rather not add back something which makes TCP stack slower
(more tests in fast path)
for the majority of us _not_ using IOMMU.

In our own tests, this trick of using linear skbs was only helping
benchmarks, not real workloads.

Many drivers have to map skb->head a second time if they contain TCP payload,
thus adding yet another corner case in their fast path.

- Typical RPC workloads are playing with TCP_NODELAY
- Typical bulk flows never have empty write queues...

Really, I do not want this optimization back, this is not worth it.

Again, a driver knows better if it is using IOMMU and if pathological
layouts can be optimized
to non SG ones, and using a pre-dma-map zone will also benefit pure
TCP ACK packets (which do not have any payload)

Here is the changelog of a patch I did for our GQ NIC (not yet
upstreamed, but will be soon)

...
   The problem is coming from gq_tx_clean() calling
    dma_unmap_single(q->dev, p->addr, -p->len, DMA_TO_DEVICE);

    This seems silly to perform possibly expensive IOMMU operations to
send small packets.
    (TCP pure acks are 86 bytes long in total for 99% of the cases)

    Idea of this patch is to pre-dma-map a memory zone to hold the
headers of the
    packet (if less than 128/256 bytes long)

    Then if the whole packet can be copied into this 128/256 bytes
zone, just copy it
    entirely.

    This permits to consume the small packets right away in ndo_start_xmit()
    while the skb (and associated socket sk_wmem_alloc) is hot, instead of later
    at TX completion time.
    This makes ACK packets cost much smaller, but also tiny TCP
packets (say, synthetic benchmarks)

    We enable this behavior only if IOMMU is used/forced on GQ,
    although we might use it regardless of IOMMU being used or not.
...
    To recap, there is a huge difference if we cross the 42 byte limit
: (for a 128 bytes zone per TX ring slot)

    iroa21:/home/edumazet# ./super_netperf 200 -H iroa23 -t TCP_RR -l
20 -- -r40,40
    2648141
    iroa21:/home/edumazet# ./super_netperf 200 -H iroa23 -t TCP_RR -l
20 -- -r44,44
     970691

    We might experiment with bigger GQ_TX_INLINE_HEADER_SIZE in the future ?
   ...