Hi! When making a TCP connection from a client to a server, connect() is usually fast, but I occasionally see a 3 or 9 second delay. I've found a few other mentions of this, with no obvious resolution: http://kerneltrap.org/mailarchive/linux-net/2008/2/4/713224/thread ("TCP connect hangs for 3 seconds") http://kerneltrap.org/mailarchive/linux-net/2008/3/26/1264734/thread ("question about 3sec timeouts with tcp") http://kerneltrap.org/mailarchive/linux-net/2008/3/31/1306874/thread (continuation of the above) http://kerneltrap.org/mailarchive/linux-kernel/2008/3/15/1170654/thread (this one seems like it might be a different issue, "[2.6.24.3][net] bug: TCP 3rd handshake abnormal timeouts") http://kerneltrap.org/mailarchive/linux-netdev/2010/4/12/6274500 ("[RFC] random SYN drops causing connect() delays") http://www.linuxquestions.org/questions/linux-networking-3/3000ms-delay-on-tcp-connections-613670/ ("") So I thought I'd share some info (disclaimer: I'm not a kernel dev and this info could be wrong--if so, please correct me!). First off, the 3 and 9 second delays come from the client. If a client tries to establish a TCP connection to a server but the server does not respond, then the client tries again after 3 seconds, then once again after 6 more seconds. The number of retry attempts is configurable by changing tcp_syn_retries ("sysctl net.ipv4.tcp_syn_retries," "man 7 tcp" for description). I don't believe the backoff delay is configuration, but I could be wrong. Why would a server drop a SYN? I'm aware of two scenarios. There may be others. 1. Good ol' fashioned packet loss. The SYN never makes it to the server. Could happen due to a bad physical network connection, congestion on a switch or router, or maybe a heavily loaded server. There's nothing mysterious here, and not much else to say about it. 2. Overflowing the listen() command's incoming connection backlog. The rest of this email deals with situation #2. Steps to Reproduce: I think I've whittled down the problem to a minimal test server and client that reproduces the issue 100% of the time (see source below). 1. "sysctl -w net.ipv4.tcp_abort_on_overflow=0" ("man 7 tcp" for description) 2. Server calls listen() on a socket with a backlog of 0 3. Client makes synchronous connections to server, calling connect() then immediately calling close(). The first three connections are instant. The fourth will be delayed by 3 or 9 seconds. What does the network traffic look like? For the first three connections the client sends SYN, server immediately replies with SYN ACK, then clients sends ACK then FIN ACK. The server does NOT respond with FIN ACK. I've been assuming that this behavior is expected. Perhaps FIN ACK will not be sent until the server process has called accept()? So that makes sense in my mind. Here's the part I don't understand: For the fourth connection the client sends SYN, but the server does not respond (neither with SYN ACK nor RST). The client resends the SYN after 3 seconds. Sometimes the server will respond with SYN ACK and things continue normally. Other times the server will ignore this second SYN, the client waits 6 more seconds and resends the SYN, to which the server will always reply with SYN ACK. Notes: - This weird behavior only happens if tcp_abort_on_overflow is 0. If it's set to 1 then incoming connections are immediately rejected when the incoming backlog is full, as expected. - The weird behavior only happens when the server's listen backlog is full. - The weird behavior doesn't happen if the server calls accept() fast enough, as expected. - The weird behavior doesn't happen if the server's listen backlog is sufficiently large, as expected. Setting a larger listen backlog allows more connections before the weird behavior manifests itself. Connections 1 through backlog+3 are always instant. Connection backlog+4 will be delayed by 3 or 9 seconds. - Once the backlog is full, the ListenOverflows and ListenDrops SNMP MIB items increase (/proc/net/netstat or netstat -s | grep -i 'listen'). They continue to increase over time until the server accepts connections, even if there are no new incoming connections. This is weird to me. - Enabling SO_LINGER on the client side will makes the issue disappear (close() will pause instead of connect()) - Happens over loopback or real networks (so the behavior comes from the TCP stack?) Closing Thoughts: Clearly servers should set a large backlog so that this doesn't happen. If the backlog is large and the server still can't accept connections fast enough, then add more servers. It's a bad idea to expect the kernel to gracefully handle the situation where the listen backlog is full. There is a max backlog setting that may need to be increased: /sbin/sysctl -w net.core.somaxconn=1024 In a server environment, monitoring ListenOverflows from /proc/net/netstat ("netstat -s | grep -i 'listen' ") is smart. I'm assuming that historically "tcp_abort_on_overflow" defaulted to 1, and this changed to 0 because having the server ignore instead of drop incoming connections causes the client's TCP stack to auto-retry, and it was thought that this would be convenient. i.e. a 3 second delay is better than an immediate failure. It also makes the server behavior consistent between the case where the process's listen backlog is full and the case where the server never receives the SYN. Whether tcp_abort_on_overflow is good is debatable, which I suppose is why it's configurable. I'm a fan of "fail fast." 3 second delays are more likely to go unnoticed, whereas failed connections might be more obvious and thus more likely to be fixed (by increasing the listen backlog or adding more servers). Once the backlog is full, why does the server SYN ACK some incoming connections but not others?? It seems like it allows a certain number of connections per second, but I haven't been able to find this logic in the kernel source. My apologies for the long email. My hope is that this information will be useful to someone in the future who may be searching for info on why this happens. --Mark Example server code: #include <arpa/inet.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/socket.h> #include <sys/time.h> #include <sys/types.h> #include <unistd.h> static double get_time() { struct timeval t; gettimeofday(&t, NULL); return t.tv_sec + (t.tv_usec / 1000000.0); } int main(int argc, char *argv[]) { unsigned short port; char *endptr; int listen_fd; int flag; struct sockaddr_in sockaddr; /* Read parameters */ if (argc != 2) { fprintf(stderr, "usage: %s <port>\n", argv[0]); exit(1); } port = strtol(argv[1], &endptr, 0); if (*endptr) { fprintf(stderr, "Invalid port number\n"); exit(2); } /* Create socket */ listen_fd = socket(AF_INET, SOCK_STREAM, 0); if (listen_fd < 0) { fprintf(stderr, "Error calling socket(): %s\n", strerror(errno)); exit(3); } /* * Allow this port to be listened on again after we exit without * having to wait for TIME_WAIT to expire. This doesn't affect the * test... it just makes it more convenient to run repeatedly. */ flag = 1; if (setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &flag, sizeof(flag)) != 0) fprintf(stderr, "Error: Could not set SO_REUSEADDR: %s\n", strerror(errno)); /* Bind socket */ memset(&sockaddr, 0, sizeof(sockaddr)); sockaddr.sin_family = AF_INET; sockaddr.sin_addr.s_addr = htonl(INADDR_ANY); sockaddr.sin_port = htons(port); if (bind(listen_fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr)) < 0) { fprintf(stderr, "Error calling bind(): %s\n", strerror(errno)); exit(4); } /* Listen */ if (listen(listen_fd, 0) < 0) { fprintf(stderr, "Error calling listen()\n"); exit(5); } #ifndef ACTUALLY_ACCEPT_CONNECTIONS_INSTEAD_OF_JUST_SLEEPING /* Do nothing--let incoming connections pile up */ printf("%0.3f Listening but not calling accept()\n", get_time()); sleep(1000); #else /* Consume connections */ while (1) { printf("%0.3f Calling accept()\n", get_time()); int client_fd = accept(listen_fd, NULL, NULL); if (client_fd < 0) { fprintf(stderr, "Error calling accept()\n"); exit(6); } /* * This 10 microsecond sleep ensures that the client (usually) * closes their side of the connection before we do. It makes * the 3 and 9 second timeouts much more common. */ usleep(10); printf("%0.3f Calling close()\n", get_time()); if (close(client_fd) != 0) { fprintf(stderr, "Error calling close(): %s\n", strerror(errno)); exit(7); } } #endif /* ACTUALLY_ACCEPT_CONNECTIONS_INSTEAD_OF_JUST_SLEEPING */ return 0; } Example client code: #include <arpa/inet.h> #include <ctype.h> #include <errno.h> #include <fcntl.h> #include <netinet/in.h> #include <netinet/tcp.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/socket.h> #include <sys/time.h> #include <sys/types.h> #include <sys/un.h> #include <sys/wait.h> #include <unistd.h> static double get_time() { struct timeval t; gettimeofday(&t, NULL); return t.tv_sec + (t.tv_usec / 1000000.0); } int main(int argc, char *argv[]) { const char *server; unsigned short port; char *endptr; struct sockaddr_in sockaddr; unsigned int count; /* Read parameters */ if (argc != 3) { fprintf(stderr, "usage: %s <ip> <port>\n", argv[0]); exit(1); } server = argv[1]; port = strtol(argv[2], &endptr, 0); if (*endptr) { fprintf(stderr, "Invalid port number\n"); exit(2); } /* Bind socket */ memset(&sockaddr, 0, sizeof(sockaddr)); sockaddr.sin_family = AF_INET; sockaddr.sin_addr.s_addr = inet_addr(server); sockaddr.sin_port = htons(port); for (count = 1; count <= 100; count++) { double time_begin; double time_end; int fd; /* Create socket */ printf("%0.3f Calling socket()\n", get_time()); fd = socket(PF_INET, SOCK_STREAM, IPPROTO_IP); if (fd < 0) { fprintf(stderr, "Error calling socket(): %s\n", strerror(errno)); exit(3); } /* Record begin time */ time_begin = get_time(); #if 0 /* * Set SO_LINGER on the client socket. This causes close() to * block until all TCP traffic for the socket has been flushed * and client and server have both acknowledged that the socket * is closed. * * Enabling this code eliminates the 3 and 9 second delays from * connect() and instead causes close() to hang until the server * accepts connections. */ struct linger linger; linger.l_onoff = 1; linger.l_linger = 100000; /* Wait effectively forever */ if (setsockopt(fd, SOL_SOCKET, SO_LINGER, &linger, sizeof(linger)) != 0) { fprintf(stderr, "Error calling socket(): %s\n", strerror(errno)); exit(3); } #endif /* Connect */ printf("%0.3f Calling connect()\n", get_time()); if (connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr)) < 0) { fprintf(stderr, "Error connecting to %s: %s\n", server, strerror(errno)); continue; } /* Record end time */ time_end = get_time(); if ((time_end - time_begin) > 1) { printf("Connection #%u took %.6lf seconds\n", count, time_end - time_begin); } /* Close socket */ printf("%0.3f Calling close()\n", get_time()); if (close(fd) != 0) { fprintf(stderr, "Error calling close(): %s\n", strerror(errno)); exit(4); } } return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html