Re: [PATCH 4.14 04/33] tcp: be more careful in tcp_fragment()

Matthieu Baerts <matthieu.baerts@xxxxxxxxxxxx> · Tue, 20 Aug 2019 18:45:38 +0200

Hi Eric,

On 08/08/2019 21:05, Greg Kroah-Hartman wrote:
> commit b617158dc096709d8600c53b6052144d12b89fab upstream.
> 
> Some applications set tiny SO_SNDBUF values and expect
> TCP to just work. Recent patches to address CVE-2019-11478
> broke them in case of losses, since retransmits might
> be prevented.
> 
> We should allow these flows to make progress.
> 
> This patch allows the first and last skb in retransmit queue
> to be split even if memory limits are hit.
> 
> It also adds the some room due to the fact that tcp_sendmsg()
> and tcp_sendpage() might overshoot sk_wmem_queued by about one full
> TSO skb (64KB size). Note this allowance was already present
> in stable backports for kernels < 4.15
> 
> Note for < 4.15 backports :
>  tcp_rtx_queue_tail() will probably look like :
> 
> static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
> {
> 	struct sk_buff *skb = tcp_send_head(sk);
> 
> 	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
> }
> 
> Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
> Signed-off-by: Eric Dumazet <edumazet@xxxxxxxxxx>
> Reported-by: Andrew Prout <aprout@xxxxxxxxxx>
> Tested-by: Andrew Prout <aprout@xxxxxxxxxx>
> Tested-by: Jonathan Lemon <jonathan.lemon@xxxxxxxxx>
> Tested-by: Michal Kubecek <mkubecek@xxxxxxx>
> Acked-by: Neal Cardwell <ncardwell@xxxxxxxxxx>
> Acked-by: Yuchung Cheng <ycheng@xxxxxxxxxx>
> Acked-by: Christoph Paasch <cpaasch@xxxxxxxxx>
> Cc: Jonathan Looney <jtl@xxxxxxxxxxx>
> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
> Signed-off-by: Matthieu Baerts <matthieu.baerts@xxxxxxxxxxxx>
> Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
> ---
>  include/net/tcp.h     | 17 +++++++++++++++++
>  net/ipv4/tcp_output.c | 11 ++++++++++-
>  2 files changed, 27 insertions(+), 1 deletion(-)

I am sorry to bother you again with the recent modifications in
tcp_fragment() but it seems we have a new kernel BUG with this patch in
v4.14.

Here is the call trace.

[26665.934461] ------------[ cut here ]------------
[26665.936152] kernel BUG at ./include/linux/skbuff.h:1406!
[26665.937941] invalid opcode: 0000 [#1] SMP PTI
[26665.977252] Call Trace:
[26665.978267]  <IRQ>
[26665.979163]  tcp_fragment+0x9c/0x2cf
[26665.980562]  tcp_write_xmit+0x68f/0x988
[26665.982031]  __tcp_push_pending_frames+0x3b/0xa0
[26665.983684]  tcp_data_snd_check+0x2a/0xc8
[26665.985196]  tcp_rcv_established+0x2a8/0x30d
[26665.986736]  tcp_v4_do_rcv+0xb2/0x158
[26665.988140]  tcp_v4_rcv+0x692/0x956
[26665.989533]  ip_local_deliver_finish+0xeb/0x169
[26665.991250]  __netif_receive_skb_core+0x51c/0x582
[26665.993028]  ? inet_gro_receive+0x239/0x247
[26665.994581]  netif_receive_skb_internal+0xab/0xc6
[26665.996340]  napi_gro_receive+0x8a/0xc0
[26665.997790]  receive_buf+0x9a1/0x9cd
[26665.999232]  ? load_balance+0x17a/0x7b7
[26666.000711]  ? vring_unmap_one+0x18/0x61
[26666.002196]  ? detach_buf+0x60/0xfa
[26666.003526]  virtnet_poll+0x128/0x1e1
[26666.004860]  net_rx_action+0x12a/0x2b1
[26666.006309]  __do_softirq+0x11c/0x26b
[26666.007734]  ? handle_irq_event+0x44/0x56
[26666.009275]  irq_exit+0x61/0xa0
[26666.010511]  do_IRQ+0x9d/0xbb
[26666.011685]  common_interrupt+0x85/0x85

We are doing the tests with the MPTCP stack[1], the error might come
from there but the call trace is free of MPTCP functions. We are still
working on having a reproducible setup with MPTCP before doing the same
without MPTCP but please see below the analysis we did so far with some
questions.

[1] https://github.com/multipath-tcp/mptcp/tree/mptcp_v0.94

> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 0b477a1e11770..7994e569644e0 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1688,6 +1688,23 @@ static inline void tcp_check_send_head(struct sock *sk, struct sk_buff *skb_unli
>  		tcp_sk(sk)->highest_sack = NULL;
>  }
>  
> +static inline struct sk_buff *tcp_rtx_queue_head(const struct sock *sk)
> +{
> +	struct sk_buff *skb = tcp_write_queue_head(sk);
> +
> +	if (skb == tcp_send_head(sk))
> +		skb = NULL;
> +
> +	return skb;
> +}
> +
> +static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
> +{
> +	struct sk_buff *skb = tcp_send_head(sk);
> +
> +	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
> +}
> +
>  static inline void __tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
>  {
>  	__skb_queue_tail(&sk->sk_write_queue, skb);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index a5960b9b6741c..a99086bf26eaf 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1264,6 +1264,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	struct sk_buff *buff;
>  	int nsize, old_factor;
> +	long limit;
>  	int nlen;
>  	u8 flags;
>  
> @@ -1274,7 +1275,15 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
>  	if (nsize < 0)
>  		nsize = 0;
>  
> -	if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 0x20000)) {
> +	/* tcp_sendmsg() can overshoot sk_wmem_queued by one full size skb.
> +	 * We need some allowance to not penalize applications setting small
> +	 * SO_SNDBUF values.
> +	 * Also allow first and last skb in retransmit queue to be split.
> +	 */
> +	limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE);
> +	if (unlikely((sk->sk_wmem_queued >> 1) > limit &&
> +		     skb != tcp_rtx_queue_head(sk) &&

If the skb returned by tcp_rtx_queue_head() is null -- because skb ==
tcp_send_head(sk), please see above -- should we not do something else
than going to the next condition?

> +		     skb != tcp_rtx_queue_tail(sk))) {

It seems that behind, tcp_write_queue_prev() can be called -- please see
above -- which will directly called skb_queue_prev() which does this:

	/* This BUG_ON may seem severe, but if we just return then we
	 * are going to dereference garbage.
	 */
	BUG_ON(skb_queue_is_first(list, skb));
	return skb->prev;

Should we do the same check before to avoid the BUG_ON()?

Could it be normal to hit this BUG_ON() with regular TCP or is it due to
other changes in MPTCP stack? We hope not having bothered you for
something not in the upstream kernel (yet) :)

Cheers,
Matt

>  		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
>  		return -ENOMEM;
>  	}
> 

-- 
Matthieu Baerts | R&D Engineer
matthieu.baerts@xxxxxxxxxxxx
Tessares SA | Hybrid Access Solutions
www.tessares.net
1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium