Re: [PATCH AUTOSEL 4.19 04/42] netfilter: conntrack: always store window size un-scaled

Reindl Harald <h.reindl@xxxxxxxxxxxxx> · Wed, 14 Aug 2019 12:19:06 +0200



that's still not in 5.2.8

without the exception and "nf_conntrack_tcp_timeout_max_retrans = 60" a
vnc-over-ssh session having the VNC view in the background freezes
within 60 secods

-----------------------------------------------------------------------------------------------
IPV4 TABLE MANGLE (STATEFUL PRE-NAT/FILTER)
-----------------------------------------------------------------------------------------------
Chain PREROUTING (policy ACCEPT 100 packets, 9437 bytes)
num   pkts bytes target     prot opt in     out     source
 destination
1     6526 3892K ACCEPT     all  --  *      *       0.0.0.0/0
 0.0.0.0/0            ctstate RELATED,ESTABLISHED
2      125  6264 ACCEPT     all  --  lo     *       0.0.0.0/0
 0.0.0.0/0
3       64  4952 ACCEPT     all  --  vmnet8 *       0.0.0.0/0
 0.0.0.0/0
4        1    40 DROP       all  --  *      *       0.0.0.0/0
 0.0.0.0/0            ctstate INVALID

-------- Weitergeleitete Nachricht --------
Betreff: [PATCH AUTOSEL 5.2 07/76] netfilter: conntrack: always store
window size un-scaled

Am 08.08.19 um 11:02 schrieb Thomas Jarosch:
> Hello together,
> 
> You wrote on Fri, Aug 02, 2019 at 09:22:24AM -0400:
>> From: Florian Westphal <fw@xxxxxxxxx>
>>
>> [ Upstream commit 959b69ef57db00cb33e9c4777400ae7183ebddd3 ]
>>
>> Jakub Jankowski reported following oddity:
>>
>> After 3 way handshake completes, timeout of new connection is set to
>> max_retrans (300s) instead of established (5 days).
>>
>> shortened excerpt from pcap provided:
>> 25.070622 IP (flags [DF], proto TCP (6), length 52)
>> 10.8.5.4.1025 > 10.8.1.2.80: Flags [S], seq 11, win 64240, [wscale 8]
>> 26.070462 IP (flags [DF], proto TCP (6), length 48)
>> 10.8.1.2.80 > 10.8.5.4.1025: Flags [S.], seq 82, ack 12, win 65535, [wscale 3]
>> 27.070449 IP (flags [DF], proto TCP (6), length 40)
>> 10.8.5.4.1025 > 10.8.1.2.80: Flags [.], ack 83, win 512, length 0
>>
>> Turns out the last_win is of u16 type, but we store the scaled value:
>> 512 << 8 (== 0x20000) becomes 0 window.
>>
>> The Fixes tag is not correct, as the bug has existed forever, but
>> without that change all that this causes might cause is to mistake a
>> window update (to-nonzero-from-zero) for a retransmit.
>>
>> Fixes: fbcd253d2448b8 ("netfilter: conntrack: lower timeout to RETRANS seconds if window is 0")
>> Reported-by: Jakub Jankowski <shasta@xxxxxxxxxxx>
>> Tested-by: Jakub Jankowski <shasta@xxxxxxxxxxx>
>> Signed-off-by: Florian Westphal <fw@xxxxxxxxx>
>> Acked-by: Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxxxxxx>
>> Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>
>> Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
> 
> Also:
> Tested-by: Thomas Jarosch <thomas.jarosch@xxxxxxxxxxxxx>
> 
> ;)
> 
> We've hit the issue with the wrong conntrack timeout at two different sites,
> long-lived connections to a SAP server over IPSec VPN were constantly dropping.
> 
> For us this was a regression after updating from kernel 3.14 to 4.19.
> Yesterday I've applied the patch to kernel 4.19.57 and the problem is fixed.
> 
> The issue was extra hard to debug as we could just boot the new kernel
> for twenty minutes in the evening on these productive systems.
> 
> The stable kernel patch from last Friday came right on time. I was just
> about the replay the TCP connection with tcpreplay, so this saved
> me from another week of debugging. Thanks everyone!