Spurious ECONNRESET errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi, linux-net! My TCP sockets are being killed by spurious ECONNRESET errors.
I seek help in understanding what is going on and (hopefully) in fixing it.

This what I see: between two Linux machines we have 6-10 open TCP
sockets carrying very light RPC-type traffic. Every ~6 hours one of these
sockets would spontaneously break with recv() returning errno==ECONNRESET on
both sides of the connection. The other sockets would stay alive.

I ran tcpdump overnight and captured 3 "broken socket" events, all
following the same pattern- healthy traffic, then a flood of ACK packets,
then a connection reset:

a) healthy traffic:

22:18:50.506527 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196040:2196176(136) ack 248977 win 7504 <nop,nop,timestamp 132778250 237445056> (DF)
22:18:50.506705 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: P 248977:248985(8) ack 2196176 win 10720 <nop,nop,timestamp 237445065 132778250> (DF)
22:18:50.506933 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: . ack 248985 win 7504 <nop,nop,timestamp 132778250 237445065> (DF)
22:18:50.506991 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196176:2196200(24) ack 248985 win 7504 <nop,nop,timestamp 132778250 237445065> (DF)
22:18:50.507335 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: P 248985:248993(8) ack 2196200 win 10720 <nop,nop,timestamp 237445065 132778250> (DF)

b) a flood of ACK packets:

22:18:50.507479 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196200:2196280(80) ack 248993 win 7504 <nop,nop,timestamp 132778250 237445065> (DF)
22:18:50.538143 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
22:18:50.538358 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
22:18:50.539411 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
... [deleted about 100 ACK packets exactly like the above]

c) a connection reset:

22:18:50.595944 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
22:18:50.596385 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
22:18:50.596521 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) [ttl 1]
22:18:50.597640 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) [ttl 1]
22:18:50.714930 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196200:2196280(80) ack 248993 win 7504 <nop,nop,timestamp 132778271 237445065> (DF)
22:18:50.715052 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: R 2688217848:2688217848(0) win 0 (DF)

d) on both sides recv() returns errno==ECONNRESET

How do I read this? Somehow e614slow is not seeing the ACK from
midtwist and retransmits the last packet? But why a flood of ACKs? (3 ACKs
per millisecond!). Why is the connection reset when e614slow retransmits
the last packet?

A few more details: Both machines are running Redhat Linux 7.2 with
the redhat 2.4.18-10 kernels. Midtwist is a 2x1GHz P-III, e614slow is
a single 400 MHz  P-III. The machines are directly connected by 100Mbit
ethernet to a CenterCom FS716 switch. There is other traffic between
them (X11, NFS, rsync, ssh), but not very heavy and we see no correlations
with broken sockets. Midtwist is heavily loaded, on it's other
network interface we continiously read 8 MBytes/second using similar
TCP/RPC code (but see no broken sockets there, maybe because the data
source is VxWorks 5.3/5.4 rather than Linux).

Any insight into this problem will be highly appreciated.

You are welcome to learn more about the TWIST experiment at TRIUMF
at http://twist.triumf.ca, and welcome to learn more about the MIDAS data
acquisition system at http://midas.psi.ch.

-- 
Konstantin Olchanski
Email: olchansk@triumf.ca
Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada
-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux 802.1Q VLAN]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Git]     [Bugtraq]     [Yosemite News and Information]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux PCI]     [Linux Admin]     [Samba]

  Powered by Linux