From: Eric Dumazet <edumazet@xxxxxxxxxx> commit b617158dc096709d8600c53b6052144d12b89fab upstream. Some applications set tiny SO_SNDBUF values and expect TCP to just work. Recent patches to address CVE-2019-11478 broke them in case of losses, since retransmits might be prevented. We should allow these flows to make progress. This patch allows the first and last skb in retransmit queue to be split even if memory limits are hit. It also adds the some room due to the fact that tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one full TSO skb (64KB size). Note this allowance was already present in stable backports for kernels < 4.15 Note for < 4.15 backports : tcp_rtx_queue_tail() will probably look like : static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) { struct sk_buff *skb = tcp_send_head(sk); return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); } Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@xxxxxxxxxx> Reported-by: Andrew Prout <aprout@xxxxxxxxxx> Tested-by: Andrew Prout <aprout@xxxxxxxxxx> Tested-by: Jonathan Lemon <jonathan.lemon@xxxxxxxxx> Tested-by: Michal Kubecek <mkubecek@xxxxxxx> Acked-by: Neal Cardwell <ncardwell@xxxxxxxxxx> Acked-by: Yuchung Cheng <ycheng@xxxxxxxxxx> Acked-by: Christoph Paasch <cpaasch@xxxxxxxxx> Cc: Jonathan Looney <jtl@xxxxxxxxxxx> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx> Signed-off-by: Matthieu Baerts <matthieu.baerts@xxxxxxxxxxxx> --- Notes: Hello, Here is the backport for linux-4.14.y branch simply by implementing functions written by Eric here in the commit message and in this email thread. It might be valid for older versions, I didn't check. In my setup with MPTCP, I had the same bug other had where TCP flows were stalled. The initial fix b6653b3629e5 (tcp: refine memory limit test in tcp_fragment()) from Eric was helping but the backport in < 4.15 was not looking safe: 1bc13903773b (tcp: refine memory limit test in tcp_fragment()). I then decided to test the new fix and it is working fine in my test environment, no stalled TCP flows in a few hours. In this email thread I see that Eric will push a patch for v4.14. I absolutely do not want to add pressure or steal his work but because I have the patch here and it is tested, I was thinking it could be a good idea to share it with others. Feel free to ignore this patch if needed. But if this patch can reduce a tiny bit Eric's workload, I would be very glad if it helps :) Because at the end, it is Eric's work, feel free to change my "Signed-off-by" by "Tested-By" if it is how it work or nothing if you prefer to wait for Eric's patch. Cheers, Matt include/net/tcp.h | 17 +++++++++++++++++ net/ipv4/tcp_output.c | 11 ++++++++++- 2 files changed, 27 insertions(+), 1 deletion(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 0b477a1e1177..7994e569644e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1688,6 +1688,23 @@ static inline void tcp_check_send_head(struct sock *sk, struct sk_buff *skb_unli tcp_sk(sk)->highest_sack = NULL; } +static inline struct sk_buff *tcp_rtx_queue_head(const struct sock *sk) +{ + struct sk_buff *skb = tcp_write_queue_head(sk); + + if (skb == tcp_send_head(sk)) + skb = NULL; + + return skb; +} + +static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) +{ + struct sk_buff *skb = tcp_send_head(sk); + + return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); +} + static inline void __tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb) { __skb_queue_tail(&sk->sk_write_queue, skb); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index a5960b9b6741..a99086bf26ea 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1264,6 +1264,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *buff; int nsize, old_factor; + long limit; int nlen; u8 flags; @@ -1274,7 +1275,15 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, if (nsize < 0) nsize = 0; - if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 0x20000)) { + /* tcp_sendmsg() can overshoot sk_wmem_queued by one full size skb. + * We need some allowance to not penalize applications setting small + * SO_SNDBUF values. + * Also allow first and last skb in retransmit queue to be split. + */ + limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE); + if (unlikely((sk->sk_wmem_queued >> 1) > limit && + skb != tcp_rtx_queue_head(sk) && + skb != tcp_rtx_queue_tail(sk))) { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG); return -ENOMEM; } -- 2.20.1