Re: Spurious ECONNRESET errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Well, that's a pretty weird problem and I have more questions than
answers, I'm afraid.

I think what might be happening with the ACK storm is that the ack
packet somehow gets put on the forwarding path at the receiving host
(e614slow) and spins there until the TTL count is exhausted, after which
the packet gets dropped by the kernel. You could confirm this by going
back to the tcpdump and checking the TTL values for all of the acks.

Failure to receive the ack explains the retransmission, but what I don't
understand is why midtwist simultaneously decides that the connection no
longer exists and responds with a reset. It shouldn't do this even if it
has received an ICMP time-exceeded error (which I don't see in the dump,
but not sure if you included anything besides TCP).

It might shed some light on this if you can take a trace from both
machines and compare them. My guess is you took this trace on e614slow,
correct?

As to why the ack would get assigned to the forwarding path, I have no
idea. Something is borked, bo doubt. Do you have forwarding enabled? Are
you using iptables?

Cheers,

	MikaL

On Sat, 2002-10-19 at 00:41, Konstantin Olchanski wrote:
> 
> Hi, linux-net! My TCP sockets are being killed by spurious ECONNRESET errors.
> I seek help in understanding what is going on and (hopefully) in fixing it.
> 
> This what I see: between two Linux machines we have 6-10 open TCP
> sockets carrying very light RPC-type traffic. Every ~6 hours one of these
> sockets would spontaneously break with recv() returning errno==ECONNRESET on
> both sides of the connection. The other sockets would stay alive.
> 
> I ran tcpdump overnight and captured 3 "broken socket" events, all
> following the same pattern- healthy traffic, then a flood of ACK packets,
> then a connection reset:
> 
> a) healthy traffic:
> 
> 22:18:50.506527 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196040:2196176(136) ack 248977 win 7504 <nop,nop,timestamp 132778250 237445056> (DF)
> 22:18:50.506705 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: P 248977:248985(8) ack 2196176 win 10720 <nop,nop,timestamp 237445065 132778250> (DF)
> 22:18:50.506933 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: . ack 248985 win 7504 <nop,nop,timestamp 132778250 237445065> (DF)
> 22:18:50.506991 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196176:2196200(24) ack 248985 win 7504 <nop,nop,timestamp 132778250 237445065> (DF)
> 22:18:50.507335 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: P 248985:248993(8) ack 2196200 win 10720 <nop,nop,timestamp 237445065 132778250> (DF)
> 
> b) a flood of ACK packets:
> 
> 22:18:50.507479 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196200:2196280(80) ack 248993 win 7504 <nop,nop,timestamp 132778250 237445065> (DF)
> 22:18:50.538143 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
> 22:18:50.538358 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
> 22:18:50.539411 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
> ... [deleted about 100 ACK packets exactly like the above]
> 
> c) a connection reset:
> 
> 22:18:50.595944 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
> 22:18:50.596385 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF)
> 22:18:50.596521 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) [ttl 1]
> 22:18:50.597640 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: . ack 2196280 win 10720 <nop,nop,timestamp 237445069 132778250> (DF) [ttl 1]
> 22:18:50.714930 e614slow.triumf.ca.58166 > midtwist.Triumf.CA.54268: P 2196200:2196280(80) ack 248993 win 7504 <nop,nop,timestamp 132778271 237445065> (DF)
> 22:18:50.715052 midtwist.Triumf.CA.54268 > e614slow.triumf.ca.58166: R 2688217848:2688217848(0) win 0 (DF)
> 
> d) on both sides recv() returns errno==ECONNRESET
> 
> How do I read this? Somehow e614slow is not seeing the ACK from
> midtwist and retransmits the last packet? But why a flood of ACKs? (3 ACKs
> per millisecond!). Why is the connection reset when e614slow retransmits
> the last packet?
> 
> A few more details: Both machines are running Redhat Linux 7.2 with
> the redhat 2.4.18-10 kernels. Midtwist is a 2x1GHz P-III, e614slow is
> a single 400 MHz  P-III. The machines are directly connected by 100Mbit
> ethernet to a CenterCom FS716 switch. There is other traffic between
> them (X11, NFS, rsync, ssh), but not very heavy and we see no correlations
> with broken sockets. Midtwist is heavily loaded, on it's other
> network interface we continiously read 8 MBytes/second using similar
> TCP/RPC code (but see no broken sockets there, maybe because the data
> source is VxWorks 5.3/5.4 rather than Linux).
> 
> Any insight into this problem will be highly appreciated.
> 
> You are welcome to learn more about the TWIST experiment at TRIUMF
> at http://twist.triumf.ca, and welcome to learn more about the MIDAS data
> acquisition system at http://midas.psi.ch.
> 
> -- 
> Konstantin Olchanski
> Email: olchansk@triumf.ca
> Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada
> -
> : send the line "unsubscribe linux-net" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux 802.1Q VLAN]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Git]     [Bugtraq]     [Yosemite News and Information]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux PCI]     [Linux Admin]     [Samba]

  Powered by Linux