Hi Eric, On 08/08/2019 21:05, Greg Kroah-Hartman wrote: > commit b617158dc096709d8600c53b6052144d12b89fab upstream. > > Some applications set tiny SO_SNDBUF values and expect > TCP to just work. Recent patches to address CVE-2019-11478 > broke them in case of losses, since retransmits might > be prevented. > > We should allow these flows to make progress. > > This patch allows the first and last skb in retransmit queue > to be split even if memory limits are hit. > > It also adds the some room due to the fact that tcp_sendmsg() > and tcp_sendpage() might overshoot sk_wmem_queued by about one full > TSO skb (64KB size). Note this allowance was already present > in stable backports for kernels < 4.15 > > Note for < 4.15 backports : > tcp_rtx_queue_tail() will probably look like : > > static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) > { > struct sk_buff *skb = tcp_send_head(sk); > > return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); > } > > Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") > Signed-off-by: Eric Dumazet <edumazet@xxxxxxxxxx> > Reported-by: Andrew Prout <aprout@xxxxxxxxxx> > Tested-by: Andrew Prout <aprout@xxxxxxxxxx> > Tested-by: Jonathan Lemon <jonathan.lemon@xxxxxxxxx> > Tested-by: Michal Kubecek <mkubecek@xxxxxxx> > Acked-by: Neal Cardwell <ncardwell@xxxxxxxxxx> > Acked-by: Yuchung Cheng <ycheng@xxxxxxxxxx> > Acked-by: Christoph Paasch <cpaasch@xxxxxxxxx> > Cc: Jonathan Looney <jtl@xxxxxxxxxxx> > Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx> > Signed-off-by: Matthieu Baerts <matthieu.baerts@xxxxxxxxxxxx> > Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx> > --- > include/net/tcp.h | 17 +++++++++++++++++ > net/ipv4/tcp_output.c | 11 ++++++++++- > 2 files changed, 27 insertions(+), 1 deletion(-) I am sorry to bother you again with the recent modifications in tcp_fragment() but it seems we have a new kernel BUG with this patch in v4.14. Here is the call trace. [26665.934461] ------------[ cut here ]------------ [26665.936152] kernel BUG at ./include/linux/skbuff.h:1406! [26665.937941] invalid opcode: 0000 [#1] SMP PTI [26665.977252] Call Trace: [26665.978267] <IRQ> [26665.979163] tcp_fragment+0x9c/0x2cf [26665.980562] tcp_write_xmit+0x68f/0x988 [26665.982031] __tcp_push_pending_frames+0x3b/0xa0 [26665.983684] tcp_data_snd_check+0x2a/0xc8 [26665.985196] tcp_rcv_established+0x2a8/0x30d [26665.986736] tcp_v4_do_rcv+0xb2/0x158 [26665.988140] tcp_v4_rcv+0x692/0x956 [26665.989533] ip_local_deliver_finish+0xeb/0x169 [26665.991250] __netif_receive_skb_core+0x51c/0x582 [26665.993028] ? inet_gro_receive+0x239/0x247 [26665.994581] netif_receive_skb_internal+0xab/0xc6 [26665.996340] napi_gro_receive+0x8a/0xc0 [26665.997790] receive_buf+0x9a1/0x9cd [26665.999232] ? load_balance+0x17a/0x7b7 [26666.000711] ? vring_unmap_one+0x18/0x61 [26666.002196] ? detach_buf+0x60/0xfa [26666.003526] virtnet_poll+0x128/0x1e1 [26666.004860] net_rx_action+0x12a/0x2b1 [26666.006309] __do_softirq+0x11c/0x26b [26666.007734] ? handle_irq_event+0x44/0x56 [26666.009275] irq_exit+0x61/0xa0 [26666.010511] do_IRQ+0x9d/0xbb [26666.011685] common_interrupt+0x85/0x85 We are doing the tests with the MPTCP stack[1], the error might come from there but the call trace is free of MPTCP functions. We are still working on having a reproducible setup with MPTCP before doing the same without MPTCP but please see below the analysis we did so far with some questions. [1] https://github.com/multipath-tcp/mptcp/tree/mptcp_v0.94 > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 0b477a1e11770..7994e569644e0 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -1688,6 +1688,23 @@ static inline void tcp_check_send_head(struct sock *sk, struct sk_buff *skb_unli > tcp_sk(sk)->highest_sack = NULL; > } > > +static inline struct sk_buff *tcp_rtx_queue_head(const struct sock *sk) > +{ > + struct sk_buff *skb = tcp_write_queue_head(sk); > + > + if (skb == tcp_send_head(sk)) > + skb = NULL; > + > + return skb; > +} > + > +static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) > +{ > + struct sk_buff *skb = tcp_send_head(sk); > + > + return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); > +} > + > static inline void __tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb) > { > __skb_queue_tail(&sk->sk_write_queue, skb); > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index a5960b9b6741c..a99086bf26eaf 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1264,6 +1264,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, > struct tcp_sock *tp = tcp_sk(sk); > struct sk_buff *buff; > int nsize, old_factor; > + long limit; > int nlen; > u8 flags; > > @@ -1274,7 +1275,15 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, > if (nsize < 0) > nsize = 0; > > - if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 0x20000)) { > + /* tcp_sendmsg() can overshoot sk_wmem_queued by one full size skb. > + * We need some allowance to not penalize applications setting small > + * SO_SNDBUF values. > + * Also allow first and last skb in retransmit queue to be split. > + */ > + limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE); > + if (unlikely((sk->sk_wmem_queued >> 1) > limit && > + skb != tcp_rtx_queue_head(sk) && If the skb returned by tcp_rtx_queue_head() is null -- because skb == tcp_send_head(sk), please see above -- should we not do something else than going to the next condition? > + skb != tcp_rtx_queue_tail(sk))) { It seems that behind, tcp_write_queue_prev() can be called -- please see above -- which will directly called skb_queue_prev() which does this: /* This BUG_ON may seem severe, but if we just return then we * are going to dereference garbage. */ BUG_ON(skb_queue_is_first(list, skb)); return skb->prev; Should we do the same check before to avoid the BUG_ON()? Could it be normal to hit this BUG_ON() with regular TCP or is it due to other changes in MPTCP stack? We hope not having bothered you for something not in the upstream kernel (yet) :) Cheers, Matt > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG); > return -ENOMEM; > } > -- Matthieu Baerts | R&D Engineer matthieu.baerts@xxxxxxxxxxxx Tessares SA | Hybrid Access Solutions www.tessares.net 1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium