Neal Cardwell <ncardwell@xxxxxxxxxx> wrote: [ trimmed CCs, add Jozsef and nf-devel ] Neal, Eric, thanks for debugging this problem. > On Sat, Apr 2, 2022 at 12:32 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > > On Sat, Apr 2, 2022 at 9:29 AM Neal Cardwell <ncardwell@xxxxxxxxxx> wrote: > > > FWIW those log entries indicate netfilter on the mail client machine > > > dropping consecutive outbound skbs with 2*MSS of payload. So that > > > explains the large consecutive losses of client data packets to the > > > e-mail server. That seems to confirm my earlier hunch that those drops > > > of consecutive client data packets "do not look like normal congestive > > > packet loss". > > > > This also explains why we have all these tiny 2-MSS packets in the pcap. > > Under normal conditions, autocorking should kick in, allowing TCP to > > build bigger TSO packets. > > I have not looked at the conntrack code before today, but AFAICT this > is the buggy section of nf_conntrack_proto_tcp.c: > > } else if (((state->state == TCP_CONNTRACK_SYN_SENT > && dir == IP_CT_DIR_ORIGINAL) > || (state->state == TCP_CONNTRACK_SYN_RECV > && dir == IP_CT_DIR_REPLY)) > && after(end, sender->td_end)) { > /* > * RFC 793: "if a TCP is reinitialized ... then it need > * not wait at all; it must only be sure to use sequence > * numbers larger than those recently used." > */ > sender->td_end = > sender->td_maxend = end; > sender->td_maxwin = (win == 0 ? 1 : win); > > tcp_options(skb, dataoff, tcph, sender); > > Note that the tcp_options() function implicitly assumes it is being > called on a SYN, because it sets state->td_scale to 0 and only sets > state->td_scale to something non-zero if it sees a wscale option. So > if we ever call that on an skb that's not a SYN, we will forget that > the connection is using the wscale option. > > But at this point in the code it is calling tcp_options() without > first checking that this is a SYN. Yes, thats the bug, tcp_options() must not be called if syn bit is not set. > For this TFO scenario like the one in the trace, where the server > sends its first data packet after the SYNACK packet and before the > client's first ACK, presumably the conntrack state machine is > (correctly) SYN_RECV, and then (incorrectly) executes this code, Right. Jozsef, for context, sequence is in trace is: S > C Flags [S], seq 3451342529, win 62580, options [mss 8940,sackOK,TS val 331187616 ecr 0,nop,wscale 7,tfo [|tcp]> C > S Flags [S.], seq 2699962254, ack 3451342530, win 65535, options [mss 1440,sackOK,TS val 1206542770 ecr 331187616,nop,wscale 8], length 0 C > S Flags [P.], seq 1:89, ack 1, win 256, options [nop,nop,TS val 1206542772 ecr 331187616], length 88: SMTP [|smtp] Normally, 3rd packet would be S > C, but this one is C > S. So, packet #3 hits the 'reinit' branch which zaps wscale option. > Someone more familiar with conntrack may have a good idea about how to > best fix this? Jozsef, does this look sane to you? It fixes the TFO capture and still passes the test case i made for 82b72cb94666b3dbd7152bb9f441b068af7a921b ("netfilter: conntrack: re-init state for retransmitted syn-ack"). diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c index 8ec55cd72572..90ad1c0f23b1 100644 --- a/net/netfilter/nf_conntrack_proto_tcp.c +++ b/net/netfilter/nf_conntrack_proto_tcp.c @@ -556,33 +556,24 @@ static bool tcp_in_window(struct nf_conn *ct, } } - } else if (((state->state == TCP_CONNTRACK_SYN_SENT - && dir == IP_CT_DIR_ORIGINAL) - || (state->state == TCP_CONNTRACK_SYN_RECV - && dir == IP_CT_DIR_REPLY)) - && after(end, sender->td_end)) { + } else if (tcph->syn && + after(end, sender->td_end) && + (state->state == TCP_CONNTRACK_SYN_SENT || + state->state == TCP_CONNTRACK_SYN_RECV)) { /* * RFC 793: "if a TCP is reinitialized ... then it need * not wait at all; it must only be sure to use sequence * numbers larger than those recently used." - */ - sender->td_end = - sender->td_maxend = end; - sender->td_maxwin = (win == 0 ? 1 : win); - - tcp_options(skb, dataoff, tcph, sender); - } else if (tcph->syn && dir == IP_CT_DIR_REPLY && - state->state == TCP_CONNTRACK_SYN_SENT) { - /* Retransmitted syn-ack, or syn (simultaneous open). * + * also check for retransmitted syn-ack, or syn (simultaneous open). * Re-init state for this direction, just like for the first * syn(-ack) reply, it might differ in seq, ack or tcp options. + * + * Check for invalid syn-ack in original direction was already done. */ tcp_init_sender(sender, receiver, skb, dataoff, tcph, end, win); - if (!tcph->ack) - return true; } if (!(tcph->ack)) {