Mysterious 3 and 9 second delays calling connect()

Mark Doliner <mark.doliner@xxxxxxxxxxxxx> · Thu, 21 Oct 2010 13:06:39 -0700

Hi!  When making a TCP connection from a client to a server, connect()
is usually fast, but I occasionally see a 3 or 9 second delay.  I've
found a few other mentions of this, with no obvious resolution:
http://kerneltrap.org/mailarchive/linux-net/2008/2/4/713224/thread
("TCP connect hangs for 3 seconds")
http://kerneltrap.org/mailarchive/linux-net/2008/3/26/1264734/thread
("question about 3sec timeouts with tcp")
http://kerneltrap.org/mailarchive/linux-net/2008/3/31/1306874/thread
(continuation of the above)
http://kerneltrap.org/mailarchive/linux-kernel/2008/3/15/1170654/thread
(this one seems like it might be a different issue, "[2.6.24.3][net]
bug: TCP 3rd handshake abnormal timeouts")
http://kerneltrap.org/mailarchive/linux-netdev/2010/4/12/6274500
("[RFC] random SYN drops causing connect() delays")
http://www.linuxquestions.org/questions/linux-networking-3/3000ms-delay-on-tcp-connections-613670/
("")

So I thought I'd share some info (disclaimer: I'm not a kernel dev and
this info could be wrong--if so, please correct me!).  First off, the
3 and 9 second delays come from the client.  If a client tries to
establish a TCP connection to a server but the server does not
respond, then the client tries again after 3 seconds, then once again
after 6 more seconds.  The number of retry attempts is configurable by
changing tcp_syn_retries ("sysctl net.ipv4.tcp_syn_retries," "man 7
tcp" for description).  I don't believe the backoff delay is
configuration, but I could be wrong.

Why would a server drop a SYN?  I'm aware of two scenarios.  There may
be others.
1. Good ol' fashioned packet loss.  The SYN never makes it to the
server.  Could happen due to a bad physical network connection,
congestion on a switch or router, or maybe a heavily loaded server.
There's nothing mysterious here, and not much else to say about it.
2. Overflowing the listen() command's incoming connection backlog.

The rest of this email deals with situation #2.

Steps to Reproduce:
I think I've whittled down the problem to a minimal test server and
client that reproduces the issue 100% of the time (see source below).
1. "sysctl -w net.ipv4.tcp_abort_on_overflow=0" ("man 7 tcp" for description)
2. Server calls listen() on a socket with a backlog of 0
3. Client makes synchronous connections to server, calling connect()
then immediately calling close().  The first three connections are
instant.  The fourth will be delayed by 3 or 9 seconds.

What does the network traffic look like?
For the first three connections the client sends SYN, server
immediately replies with SYN ACK, then clients sends ACK then FIN ACK.
 The server does NOT respond with FIN ACK.  I've been assuming that
this behavior is expected.  Perhaps FIN ACK will not be sent until the
server process has called accept()?  So that makes sense in my mind.

Here's the part I don't understand: For the fourth connection the
client sends SYN, but the server does not respond (neither with SYN
ACK nor RST).  The client resends the SYN after 3 seconds.  Sometimes
the server will respond with SYN ACK and things continue normally.
Other times the server will ignore this second SYN, the client waits 6
more seconds and resends the SYN, to which the server will always
reply with SYN ACK.

Notes:
- This weird behavior only happens if tcp_abort_on_overflow is 0.  If
it's set to 1 then incoming connections are immediately rejected when
the incoming backlog is full, as expected.
- The weird behavior only happens when the server's listen backlog is full.
- The weird behavior doesn't happen if the server calls accept() fast
enough, as expected.
- The weird behavior doesn't happen if the server's listen backlog is
sufficiently large, as expected.  Setting a larger listen backlog
allows more connections before the weird behavior manifests itself.
Connections 1 through backlog+3 are always instant.  Connection
backlog+4 will be delayed by 3 or 9 seconds.
- Once the backlog is full, the ListenOverflows and ListenDrops SNMP
MIB items increase (/proc/net/netstat or netstat -s | grep -i
'listen').  They continue to increase over time until the server
accepts connections, even if there are no new incoming connections.
This is weird to me.
- Enabling SO_LINGER on the client side will makes the issue disappear
(close() will pause instead of connect())
- Happens over loopback or real networks (so the behavior comes from
the TCP stack?)

Closing Thoughts:
Clearly servers should set a large backlog so that this doesn't
happen.  If the backlog is large and the server still can't accept
connections fast enough, then add more servers.  It's a bad idea to
expect the kernel to gracefully handle the situation where the listen
backlog is full.  There is a max backlog setting that may need to be
increased: /sbin/sysctl -w net.core.somaxconn=1024

In a server environment, monitoring ListenOverflows from
/proc/net/netstat ("netstat -s | grep -i 'listen' ") is smart.

I'm assuming that historically "tcp_abort_on_overflow" defaulted to 1,
and this changed to 0 because having the server ignore instead of drop
incoming connections causes the client's TCP stack to auto-retry, and
it was thought that this would be convenient.  i.e. a 3 second delay
is better than an immediate failure.  It also makes the server
behavior consistent between the case where the process's listen
backlog is full and the case where the server never receives the SYN.
Whether tcp_abort_on_overflow is good is debatable, which I suppose is
why it's configurable.  I'm a fan of "fail fast."  3 second delays are
more likely to go unnoticed, whereas failed connections might be more
obvious and thus more likely to be fixed (by increasing the listen
backlog or adding more servers).

Once the backlog is full, why does the server SYN ACK some incoming
connections but not others??  It seems like it allows a certain number
of connections per second, but I haven't been able to find this logic
in the kernel source.

My apologies for the long email.  My hope is that this information
will be useful to someone in the future who may be searching for info
on why this happens.
--Mark

Example server code:
#include <arpa/inet.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

static double get_time()
{
	struct timeval t;
	gettimeofday(&t, NULL);
	return t.tv_sec + (t.tv_usec / 1000000.0);
}

int main(int argc, char *argv[])
{
	unsigned short port;
	char *endptr;
	int listen_fd;
	int flag;
	struct sockaddr_in sockaddr;

	/* Read parameters */
	if (argc != 2) {
		fprintf(stderr, "usage: %s <port>\n", argv[0]);
		exit(1);
	}
	port = strtol(argv[1], &endptr, 0);
	if (*endptr) {
		fprintf(stderr, "Invalid port number\n");
		exit(2);
	}

	/* Create socket */
	listen_fd = socket(AF_INET, SOCK_STREAM, 0);
	if (listen_fd < 0) {
		fprintf(stderr, "Error calling socket(): %s\n", strerror(errno));
		exit(3);
	}

	/*
	 * Allow this port to be listened on again after we exit without
	 * having to wait for TIME_WAIT to expire.  This doesn't affect the
	 * test... it just makes it more convenient to run repeatedly.
	 */
	flag = 1;
	if (setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &flag, sizeof(flag)) != 0)
		fprintf(stderr, "Error: Could not set SO_REUSEADDR: %s\n", strerror(errno));

	/* Bind socket */
	memset(&sockaddr, 0, sizeof(sockaddr));
	sockaddr.sin_family = AF_INET;
	sockaddr.sin_addr.s_addr = htonl(INADDR_ANY);
	sockaddr.sin_port = htons(port);
	if (bind(listen_fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr)) < 0) {
		fprintf(stderr, "Error calling bind(): %s\n", strerror(errno));
		exit(4);
	}

	/* Listen */
	if (listen(listen_fd, 0) < 0) {
		fprintf(stderr, "Error calling listen()\n");
		exit(5);
	}

#ifndef ACTUALLY_ACCEPT_CONNECTIONS_INSTEAD_OF_JUST_SLEEPING
	/* Do nothing--let incoming connections pile up */
	printf("%0.3f  Listening but not calling accept()\n", get_time());
	sleep(1000);
#else
	/* Consume connections */
	while (1) {
		printf("%0.3f  Calling accept()\n", get_time());
		int client_fd = accept(listen_fd, NULL, NULL);
		if (client_fd < 0) {
			fprintf(stderr, "Error calling accept()\n");
			exit(6);
		}

		/*
		 * This 10 microsecond sleep ensures that the client (usually)
		 * closes their side of the connection before we do.  It makes
		 * the 3 and 9 second timeouts much more common.
		 */
		usleep(10);

		printf("%0.3f  Calling close()\n", get_time());
		if (close(client_fd) != 0) {
			fprintf(stderr, "Error calling close(): %s\n", strerror(errno));
			exit(7);
		}
	}
#endif /* ACTUALLY_ACCEPT_CONNECTIONS_INSTEAD_OF_JUST_SLEEPING */

	return 0;
}

Example client code:
#include <arpa/inet.h>
#include <ctype.h>
#include <errno.h>
#include <fcntl.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/un.h>
#include <sys/wait.h>
#include <unistd.h>

static double get_time()
{
	struct timeval t;
	gettimeofday(&t, NULL);
	return t.tv_sec + (t.tv_usec / 1000000.0);
}

int main(int argc, char *argv[])
{
	const char *server;
	unsigned short port;
	char *endptr;
	struct sockaddr_in sockaddr;
	unsigned int count;

	/* Read parameters */
	if (argc != 3) {
		fprintf(stderr, "usage: %s <ip> <port>\n", argv[0]);
		exit(1);
	}
	server = argv[1];
	port = strtol(argv[2], &endptr, 0);
	if (*endptr) {
		fprintf(stderr, "Invalid port number\n");
		exit(2);
	}

	/* Bind socket */
	memset(&sockaddr, 0, sizeof(sockaddr));
	sockaddr.sin_family = AF_INET;
	sockaddr.sin_addr.s_addr = inet_addr(server);
	sockaddr.sin_port = htons(port);

	for (count = 1; count <= 100; count++) {
		double time_begin;
		double time_end;
		int fd;

		/* Create socket */
		printf("%0.3f  Calling socket()\n", get_time());
		fd = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
		if (fd < 0) {
			fprintf(stderr, "Error calling socket(): %s\n", strerror(errno));
			exit(3);
		}

		/* Record begin time */
		time_begin = get_time();

#if 0
		/*
		 * Set SO_LINGER on the client socket.  This causes close() to
		 * block until all TCP traffic for the socket has been flushed
		 * and client and server have both acknowledged that the socket
		 * is closed.
		 *
		 * Enabling this code eliminates the 3 and 9 second delays from
		 * connect() and instead causes close() to hang until the server
		 * accepts connections.
		 */
		struct linger linger;
		linger.l_onoff = 1;
		linger.l_linger = 100000; /* Wait effectively forever */
		if (setsockopt(fd, SOL_SOCKET, SO_LINGER, &linger, sizeof(linger)) != 0) {
			fprintf(stderr, "Error calling socket(): %s\n", strerror(errno));
			exit(3);
		}
#endif

		/* Connect */
		printf("%0.3f  Calling connect()\n", get_time());
		if (connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr)) < 0) {
			fprintf(stderr, "Error connecting to %s: %s\n", server, strerror(errno));
			continue;
		}

		/* Record end time */
		time_end = get_time();
		if ((time_end - time_begin) > 1) {
			printf("Connection #%u took %.6lf seconds\n", count, time_end - time_begin);
		}

		/* Close socket */
		printf("%0.3f  Calling close()\n", get_time());
		if (close(fd) != 0) {
			fprintf(stderr, "Error calling close(): %s\n", strerror(errno));
			exit(4);
		}
	}

	return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html