On Thu, Jan 30, 2020 at 4:41 AM <sjpark@xxxxxxxxxx> wrote: > > On Wed, 29 Jan 2020 09:52:43 -0800 Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > > > On Wed, Jan 29, 2020 at 9:14 AM <sjpark@xxxxxxxxxx> wrote: > > > > > > Hello, > > > > > > > > > We found races in the kernel code that incur latency spikes. We thus would > > > like to share our investigations and hear your opinions. > > > > > > > > > Problem Reproduce > > > ================= > > > > > > You can reproduce the problem by compiling and running source code of > > > 'server.c' and 'client.c', which I pasted at the end of this mail, as below: > > > > > > $ gcc -o client client.c > > > $ gcc -o server server.c > > > $ ./server & > > > $ ./client > > > ... > > > port: 45150, lat: 1005320, avg: 229, nr: 1070811 > > > ... > > > > > > > Thanks for the repro ! > > My pleasure :) > > > > [...] > > > Experimental Fix > > > ---------------- > > > > > > We confirmed this is the case by logging and some experiments. Further, > > > because the process of RST/ACK packet would stuck in front of the critical > > > section while the ACK is being processed inside the critical section in most > > > case, we add one more check of the RST/ACK inside the critical section. In > > > detail, it's as below: > > > > > > --- a/net/ipv4/tcp_ipv4.c > > > +++ b/net/ipv4/tcp_ipv4.c > > > @@ -1912,6 +1912,29 @@ int tcp_v4_rcv(struct sk_buff *skb) > > > tcp_segs_in(tcp_sk(sk), skb); > > > ret = 0; > > > if (!sock_owned_by_user(sk)) { > > > + // While waiting for the socket lock, the sk may have > > > + // transitioned to FIN_WAIT2/TIME_WAIT so lookup the > > > + // twsk and if one is found reprocess the skb > > > + if (unlikely(sk->sk_state == TCP_CLOSE && !th->syn > > > + && (th->fin || th->rst))) { > > > + struct sock *sk2 = __inet_lookup_established( > > > + net, &tcp_hashinfo, > > > + iph->saddr, th->source, > > > + iph->daddr, ntohs(th->dest), > > > + inet_iif(skb), sdif); > > > + if (sk2) { > > > + if (sk2 == sk) { > > > + sock_put(sk2); > > > + } else { > > > + bh_unlock_sock(sk); > > > + tcp_v4_restore_cb(skb); > > > + if (refcounted) sock_put(sk); > > > + sk = sk2; > > > + refcounted = true; > > > + goto process; > > > + } > > > + } > > > + } > > > > > > Here are my comments > > > > > > 1) This fixes IPv4 side only, so it can not be a proper fix. > > > > 2) TCP is best effort. You can retry the lookup in ehash tables as > > many times you want, a race can always happen after your last lookup. > > > > Normal TCP flows going through a real NIC wont hit this race, since > > all packets for a given 4-tuple are handled by one cpu (RSS affinity) > > > > Basically, the race here is that 2 packets for the same flow are > > handled by two cpus. > > Who wins the race is random, we can not enforce a particular order. > > Thank you for the comments! I personally agree with your opinions. > > > > > I would rather try to fix the issue more generically, without adding > > extra lookups as you did, since they might appear > > to reduce the race, but not completely fix it. > > > > For example, the fact that the client side ignores the RST and > > retransmits a SYN after one second might be something that should be > > fixed. > > I also agree with this direction. It seems detecting this situation and > adjusting the return value of tcp_timeout_init() to a value much lower than the > one second would be a straightforward solution. For a test, I modified the > function to return 1 (4ms for CONFIG_HZ=250) and confirmed the reproducer be > silent. My following question is, how we can detect this situation in kernel? > However, I'm unsure how we can distinguish this specific case from other cases, > as everything is working as normal according to the TCP protocol. > > Also, it seems the value is made to be adjustable from the user space using the > bpf callback, BPF_SOCK_OPS_TIMEOUT_INIT: > > BPF_SOCK_OPS_TIMEOUT_INIT, /* Should return SYN-RTO value to use or > * -1 if default value should be used > */ > > Thus, it sounds like you are suggesting to do the detection and adjustment from > user space. Am I understanding your point? If not, please let me know. > No, I was suggesting to implement a mitigation in the kernel : When in SYN_SENT state, receiving an suspicious ACK should not simply trigger a RST. There are multiple ways maybe to address the issue. 1) Abort the SYN_SENT state and let user space receive an error to its connect() immediately. 2) Instead of a RST, allow the first SYN retransmit to happen immediately (This is kind of a challenge SYN. Kernel already implements challenge acks) 3) After RST is sent (to hopefully clear the state of the remote), schedule a SYN rtx in a few ms, instead of ~ one second. > > > > > > > > 11:57:14.436259 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq > > 2560603644, win 65495, options [mss 65495,sackOK,TS val 953760623 ecr > > 0,nop,wscale 7], length 0 > > 11:57:14.436266 IP 127.0.0.1.4242 > 127.0.0.1.45150: Flags [.], ack 5, > > win 512, options [nop,nop,TS val 953760623 ecr 953759375], length 0 > > 11:57:14.436271 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [R], seq > > 2541101298, win 0, length 0 > > 11:57:15.464613 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq > > 2560603644, win 65495, options [mss 65495,sackOK,TS val 953761652 ecr > > 0,nop,wscale 7], length 0 > > > > > > > > skb_to_free = sk->sk_rx_skb_cache; > > > sk->sk_rx_skb_cache = NULL; > > > ret = tcp_v4_do_rcv(sk, skb); > > > > > > We applied this change to the kernel and confirmed that the latency spikes > > > disappeared with the reproduce program. > > > > > > > > > More Races > > > ---------- > > > > > > Further, the man who found the code path and made the fix found another race > > > resulted from the commit ec94c2696f0b ("tcp/dccp: avoid one atomic operation > > > for timewait hashdance"). He believes the 'refcount_set()' should be done > > > before the 'spin_lock()', as it allows others to see the packet in the list but > > > ignore as the reference count is zero. This race seems much rare than the > > > above one and thus we have no reproducible test for this, yet. > > > > Again, TCP is best effort, seeing the refcount being 0 or not is > > absolutely fine. > > > > The cpu reading the refcnt can always be faster than the cpu setting > > the refcount to non zero value, no matter how hard you try. > > > > The rules are more like : we need to ensure all fields have > > stable/updated values before allowing other cpus to get the object. > > Therefore, writing a non zero refcount should happen last. > > I personally agree on this, either. > > > Thanks, > SeongJae Park > > > > > Thanks. > [...]