Re: [PATCH] tcp: be more careful in tcp_fragment()

Eric Dumazet <edumazet@xxxxxxxxxx> · Tue, 6 Aug 2019 17:25:35 +0200



On Tue, Aug 6, 2019 at 5:09 PM Matthieu Baerts
<matthieu.baerts@xxxxxxxxxxxx> wrote:
>
> From: Eric Dumazet <edumazet@xxxxxxxxxx>
>
> commit b617158dc096709d8600c53b6052144d12b89fab upstream.
>
> Some applications set tiny SO_SNDBUF values and expect
> TCP to just work. Recent patches to address CVE-2019-11478
> broke them in case of losses, since retransmits might
> be prevented.
>
> We should allow these flows to make progress.
>
> This patch allows the first and last skb in retransmit queue
> to be split even if memory limits are hit.
>
> It also adds the some room due to the fact that tcp_sendmsg()
> and tcp_sendpage() might overshoot sk_wmem_queued by about one full
> TSO skb (64KB size). Note this allowance was already present
> in stable backports for kernels < 4.15
>
> Note for < 4.15 backports :
>  tcp_rtx_queue_tail() will probably look like :
>
> static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
> {
>         struct sk_buff *skb = tcp_send_head(sk);
>
>         return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
> }
>
> Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
> Signed-off-by: Eric Dumazet <edumazet@xxxxxxxxxx>
> Reported-by: Andrew Prout <aprout@xxxxxxxxxx>
> Tested-by: Andrew Prout <aprout@xxxxxxxxxx>
> Tested-by: Jonathan Lemon <jonathan.lemon@xxxxxxxxx>
> Tested-by: Michal Kubecek <mkubecek@xxxxxxx>
> Acked-by: Neal Cardwell <ncardwell@xxxxxxxxxx>
> Acked-by: Yuchung Cheng <ycheng@xxxxxxxxxx>
> Acked-by: Christoph Paasch <cpaasch@xxxxxxxxx>
> Cc: Jonathan Looney <jtl@xxxxxxxxxxx>
> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
> Signed-off-by: Matthieu Baerts <matthieu.baerts@xxxxxxxxxxxx>
> ---
>
> Notes:
>     Hello,
>
>     Here is the backport for linux-4.14.y branch simply by implementing
>     functions written by Eric here in the commit message and in this email
>     thread. It might be valid for older versions, I didn't check.
>
>     In my setup with MPTCP, I had the same bug other had where TCP flows
>     were stalled. The initial fix b6653b3629e5 (tcp: refine memory limit
>     test in tcp_fragment()) from Eric was helping but the backport in
>     < 4.15 was not looking safe: 1bc13903773b (tcp: refine memory limit
>     test in tcp_fragment()).
>
>     I then decided to test the new fix and it is working fine in my test
>     environment, no stalled TCP flows in a few hours.
>
>     In this email thread I see that Eric will push a patch for v4.14. I
>     absolutely do not want to add pressure or steal his work but because I
>     have the patch here and it is tested, I was thinking it could be a good
>     idea to share it with others. Feel free to ignore this patch if needed.
>     But if this patch can reduce a tiny bit Eric's workload, I would be
>     very glad if it helps :)
>     Because at the end, it is Eric's work, feel free to change my
>     "Signed-off-by" by "Tested-By" if it is how it work or nothing if you
>     prefer to wait for Eric's patch.

This patch is fine, I was simply on vacation last week, and wanted to
truly take full advantage of them ;)

Thanks !

>
>     Cheers,
>     Matt
>
>  include/net/tcp.h     | 17 +++++++++++++++++
>  net/ipv4/tcp_output.c | 11 ++++++++++-
>  2 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 0b477a1e1177..7994e569644e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1688,6 +1688,23 @@ static inline void tcp_check_send_head(struct sock *sk, struct sk_buff *skb_unli
>                 tcp_sk(sk)->highest_sack = NULL;
>  }
>
> +static inline struct sk_buff *tcp_rtx_queue_head(const struct sock *sk)
> +{
> +       struct sk_buff *skb = tcp_write_queue_head(sk);
> +
> +       if (skb == tcp_send_head(sk))
> +               skb = NULL;
> +
> +       return skb;
> +}
> +
> +static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
> +{
> +       struct sk_buff *skb = tcp_send_head(sk);
> +
> +       return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
> +}
> +
>  static inline void __tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
>  {
>         __skb_queue_tail(&sk->sk_write_queue, skb);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index a5960b9b6741..a99086bf26ea 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1264,6 +1264,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct sk_buff *buff;
>         int nsize, old_factor;
> +       long limit;
>         int nlen;
>         u8 flags;
>
> @@ -1274,7 +1275,15 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
>         if (nsize < 0)
>                 nsize = 0;
>
> -       if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 0x20000)) {
> +       /* tcp_sendmsg() can overshoot sk_wmem_queued by one full size skb.
> +        * We need some allowance to not penalize applications setting small
> +        * SO_SNDBUF values.
> +        * Also allow first and last skb in retransmit queue to be split.
> +        */
> +       limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE);
> +       if (unlikely((sk->sk_wmem_queued >> 1) > limit &&
> +                    skb != tcp_rtx_queue_head(sk) &&
> +                    skb != tcp_rtx_queue_tail(sk))) {
>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
>                 return -ENOMEM;
>         }
> --
> 2.20.1
>