Hi, linux-net! My TCP sockets are being killed by spurious ECONNRESET errors. I seek help in understanding what is going on and (hopefully) in fixing it. This what I see: between two Linux machines we have 6-10 open TCP sockets carrying very light RPC-type traffic. Every ~6 hours one of these sockets would spontaneously break with recv() returning errno==ECONNRESET on both sides of the connection. The other sockets would stay alive. I ran tcpdump overnight and captured 3 "broken socket" events, all following the same pattern- healthy traffic, then a flood of ACK packets, then a connection reset: a) healthy traffic: 22:18:50.506527 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196040:2196176(136) ack 248977 win 7504 <nop,nop,timestamp 132778250 237445056> (DF) 22:18:50.506705 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: P 248977:248985(8) ack 2196176 win 10720 <nop,nop,timestamp 237445065 132778250> (DF) 22:18:50.506933 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: . ack 248985 win 7504 <nop,nop,timestamp 132778250 237445065> (DF) 22:18:50.506991 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196176:2196200(24) ack 248985 win 7504 <nop,nop,timestamp 132778250 237445065> (DF) 22:18:50.507335 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: P 248985:248993(8) ack 2196200 win 10720 <nop,nop,timestamp 237445065 132778250> (DF) b) a flood of ACK packets: 22:18:50.507479 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196200:2196280(80) ack 248993 win 7504 <nop,nop,timestamp 132778250 237445065> (DF) 22:18:50.538143 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) 22:18:50.538358 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) 22:18:50.539411 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) ... [deleted about 100 ACK packets exactly like the above] c) a connection reset: 22:18:50.595944 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) 22:18:50.596385 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) 22:18:50.596521 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) [ttl 1] 22:18:50.597640 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) [ttl 1] 22:18:50.714930 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196200:2196280(80) ack 248993 win 7504 <nop,nop,timestamp 132778271 237445065> (DF) 22:18:50.715052 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: R 2688217848:2688217848(0) win 0 (DF) d) on both sides recv() returns errno==ECONNRESET How do I read this? Somehow e614slow is not seeing the ACK from midtwist and retransmits the last packet? But why a flood of ACKs? (3 ACKs per millisecond!). Why is the connection reset when e614slow retransmits the last packet? A few more details: Both machines are running Redhat Linux 7.2 with the redhat 2.4.18-10 kernels. Midtwist is a 2x1GHz P-III, e614slow is a single 400 MHz P-III. The machines are directly connected by 100Mbit ethernet to a CenterCom FS716 switch. There is other traffic between them (X11, NFS, rsync, ssh), but not very heavy and we see no correlations with broken sockets. Midtwist is heavily loaded, on it's other network interface we continiously read 8 MBytes/second using similar TCP/RPC code (but see no broken sockets there, maybe because the data source is VxWorks 5.3/5.4 rather than Linux). Any insight into this problem will be highly appreciated. You are welcome to learn more about the TWIST experiment at TRIUMF at http://twist.triumf.ca, and welcome to learn more about the MIDAS data acquisition system at http://midas.psi.ch. -- Konstantin Olchanski Email: olchansk@triumf.ca Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html