Re: [RFC net] tcp: Fix performance regression for request-response workloads

Eric Dumazet <edumazet@xxxxxxxxxx> · Wed, 7 Sep 2022 09:06:36 -0700

On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@xxxxxxxxxxxxx> wrote:
>
> Since linear payload was removed even for single small messages,
> an additional page is required and we are measuring performance impact.
>
> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
> explicitely allowed "payload in skb->head for first skb put in the queue,
> to not impact RPC workloads."
> 472c2e07eef0 ("tcp: add one skb cache for tx")
> made that obsolete and removed it.
> When
> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
> reverted it, this piece was not reverted and not added back in.
>
> When running uperf with a request-response pattern with 1k payload
> and 250 connections parallel, we measure 13% difference in throughput
> for our PCI based network interfaces since 472c2e07eef0.
> (our IO MMU is sensitive to the number of mapped pages)

>
> Could you please consider allowing linear payload for the first
> skb in queue again? A patch proposal is appended below.

No.

Please add a work around in your driver.

You can increase throughput by 20% by premapping a coherent piece of
memory in which
you can copy small skbs (skb->head included)

Something like 256 bytes per slot in the TX ring.

>
> Kind regards
> Alexandra
>
> ---------------------------------------------------------------
>
> tcp: allow linear skb payload for first in queue
>
> Allow payload in skb->head for first skb in the queue,
> RPC workloads will benefit.
>
> Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx")
> Signed-off-by: Alexandra Winter <wintera@xxxxxxxxxxxxx>
> ---
>  net/ipv4/tcp.c | 39 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e5011c136fdb..f7cbccd41d85 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1154,6 +1154,30 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
>  }
>  EXPORT_SYMBOL(tcp_sendpage);
>
> +/* Do not bother using a page frag for very small frames.
> + * But use this heuristic only for the first skb in write queue.
> + *
> + * Having no payload in skb->head allows better SACK shifting
> + * in tcp_shift_skb_data(), reducing sack/rack overhead, because
> + * write queue has less skbs.
> + * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB.
> + * This also speeds up tso_fragment(), since it won't fallback
> + * to tcp_fragment().
> + */
> +static int linear_payload_sz(bool first_skb)
> +{
> +               if (first_skb)
> +                       return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
> +               return 0;
> +}
> +
> +static int select_size(bool first_skb, bool zc)
> +{
> +               if (zc)
> +                       return 0;
> +               return linear_payload_sz(first_skb);
> +}
> +
>  void tcp_free_fastopen_req(struct tcp_sock *tp)
>  {
>         if (tp->fastopen_req) {
> @@ -1311,6 +1335,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>
>                 if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
>                         bool first_skb;
> +                       int linear;
>
>  new_segment:
>                         if (!sk_stream_memory_free(sk))
> @@ -1322,7 +1347,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                                         goto restart;
>                         }
>                         first_skb = tcp_rtx_and_write_queues_empty(sk);
> -                       skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
> +                       linear = select_size(first_skb, zc);
> +                       skb = tcp_stream_alloc_skb(sk, linear,
> +                                                  sk->sk_allocation,
>                                                    first_skb);
>                         if (!skb)
>                                 goto wait_for_space;
> @@ -1344,7 +1371,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                 if (copy > msg_data_left(msg))
>                         copy = msg_data_left(msg);
>
> -               if (!zc) {
> +               /* Where to copy to? */
> +               if (skb_availroom(skb) > 0 && !zc) {
> +                       /* We have some space in skb head. Superb! */
> +                       copy = min_t(int, copy, skb_availroom(skb));
> +                       err = skb_add_data_nocache(sk, skb, &msg->msg_iter,
> +                                                  copy);
> +                       if (err)
> +                               goto do_error;
> +               } else if (!zc) {
>                         bool merge = true;
>                         int i = skb_shinfo(skb)->nr_frags;
>                         struct page_frag *pfrag = sk_page_frag(sk);
> --
> 2.24.3 (Apple Git-128)
>