Hi Leo, Thanks for the info... ----- "Leo" <neleo@xxxxxxx> wrote: > Dear Nathan, > > unfortunately nobody has answered to my first post, but at the same > time there was another thread about 3sec timeouts. Therefore we have > discussed the issue on this thread: > > http://marc.info/?t=120655182600018&r=1&w=2 > > Please read this thread for further details. So far we have a > workaround but no solution > (http://marc.info/?l=linux-net&m=121803213124143&w=2). > But AFAIK the current kernel versions are not affected anymore. > > Hope this helps! > > Kind regards, > Leo > > > Nathan Scott wrote: > > Hi Leo, > > > > I came across your mail related to a 3 second connect > > delay from early last year: > > http://marc.info/?l=linux-net&m=120214156112338&w=2 > > > > Did you ever get to the bottom of this? We're also now > > observing it in our environment (we're running RHEL5, so > > yet another, and different, kernel version to you). > > > > thanks! > > I've been digging into this a little too, so it was good to read of other peoples experiences and analysis of this issue so far. I have observed it as recently as fc11 (2.6.29.6-217, so its not clear whether it is fixed yet or not - I've not tried a current mainline kernel as yet though). I came across another workaround, and have some more information that might help the networking folks zero in on this some more, though I'm quickly bit out of my depth in some of the networking code. I've attached a slimmed-down and modified version of your test case (leo.c). With no arguments, this is your original test case and it hits the problem very quickly. With an argument, this now sets SO_LINGER, which seems to avoid the problem... $ gcc -g -Wall -o leo leo.c $ ./leo 127.0.0.1(25): 3.000366 seconds elapsed after 262 iterations $ ./leo 1 127.0.0.1(25): 1000 connections completed 127.0.0.1(25): 2000 connections completed 127.0.0.1(25): 3000 connections completed 127.0.0.1(25): 4000 connections completed 127.0.0.1(25): 5000 connections completed 127.0.0.1(25): 6000 connections completed 127.0.0.1(25): 7000 connections completed 127.0.0.1(25): 8000 connections completed 127.0.0.1(25): 9000 connections completed 127.0.0.1(25): 10000 connections completed 127.0.0.1(25): 11000 connections completed I've also attached a systemtap script which I've been using to attempt to find the source of the delay, and to follow the code in the kernel thats executing when the test case fails. Below is some of its output, from the tail of a failed run. It suggests to me that perhaps we have a race condition here, in the cleanup following a close(2) on the socket file descriptor. If we set SO_LINGER, then close() will wait "until all queued messages for the socket have been successfully sent or the linger timeout has been reached." The systemtap trace points toward inet_stream_connect() as the source of the 3sec delay - it calls inet_wait_for_connect() which AIUI loops waiting for a TCPF_SYN_SENT|TCPF_SYN_RECV state. I'm guessing that somehow the messages which would normally set those state are consumed in a different part of the code when the race occurs. It will need someone with stronger net-fu than me to figure that out, though. Here's the output from the attached stap script, anyway, with a few random thoughts interspersed in braces... $ sudo stap socket.stap ... sys_connect: elapsed 70us (started 1250728561153903us, ended 1250728561153973us): leo [thats the end of the previous, non-failing connect(2) call)] lock_sock started: 1250728561153977 ns lock_sock: elapsed 1us (started 1250728561153977us, ended 1250728561153978us): leo release_sock: elapsed 1us (started 1250728561153980us, ended 1250728561153981us): leo lock_sock started: 1250728561153985 ns lock_sock: elapsed 1us (started 1250728561153985us, ended 1250728561153986us): leo sk_stream_wait_close: elapsed 1us (started 1250728561153993us, ended 1250728561153994us): leo release_sock: elapsed 1us (started 1250728561153996us, ended 1250728561153997us): leo tcp_close: elapsed 15us (started 1250728561153984us, ended 1250728561153999us): leo inet_release: elapsed 18us (started 1250728561153983us, ended 1250728561154001us): leo sock_close: elapsed 27us (started 1250728561153976us, ended 1250728561154003us): leo lock_sock started: 1250728561154023 ns lock_sock: elapsed 2us (started 1250728561154023us, ended 1250728561154025us): leo tcp_connect: elapsed 14us (started 1250728561154032us, ended 1250728561154046us): leo tcp_v4_connect: elapsed 20us (started 1250728561154029us, ended 1250728561154049us): leo release_sock: elapsed 1us (started 1250728561154068us, ended 1250728561154069us): leo schedule_timeout CALLED BY unknown, 9223372036854775807 timeo [bzzt. note the start timestamp on schedule() below - 125072856_4_... compared to the end of tcp_v4_connect - 125072856_1_... -- 3 seconds. And release_sock, called just before schedule_timout in the call to inet_wait_for_connect has the ..._1_... timestamp - so, pretty sure we lose the 3 seconds right there] schedule: elapsed 6us (started 1250728564153581us, ended 1250728564153587us): leo schedule_timeout CALLED BY inet_stream_connect, 9223372036854775807 rval Returning from: 0xffffffff800637c9 : schedule_timeout+0x0/0xad Returning to : 0xffffffff800591e6 : inet_stream_connect+0x132/0x236 0xffffffff8009df5d : autoremove_wake_function+0x0/0x2e 0xffffffff8007e833 : kretprobe_trampoline_holder+0x0/0x2 0xffffffff800b4153 : audit_syscall_entry+0x16e/0x1a1 0xffffffff8007e833 : kretprobe_trampoline_holder+0x0/0x2 schedule_timeout: elapsed 69us (started 1250728564153545us, ended 1250728564153614us): leo lock_sock started: 1250728564153619 ns lock_sock: elapsed 2us (started 1250728564153620us, ended 1250728564153622us): leo release_sock: elapsed 0us (started 1250728564153629us, ended 1250728564153629us): leo inet_stream_connect: elapsed 2999617us (started 1250728561154023us, ended 1250728564153640us): leo [note elapsed time above - 3sec somewhere inside inet_stream_connect, but in the second half of that routine, as it calls tcp_v4_connect and that shows an early-on timestamp at its end] Returning from: 0xffffffff800590b4 : inet_stream_connect+0x0/0x236 Returning to : 0xffffffff8020d9db : sys_connect+0x7e/0xae 0xffffffff800b4153 : audit_syscall_entry+0x16e/0x1a1 0xffffffff8007e833 : kretprobe_trampoline_holder+0x0/0x2 [we also don't see the final elapsed time for sys_connect now, as my systemtap script calls exit as soon as any function takes more than a second to complete]. This trace data, with Gabriels observation ( in this message: http://marc.info/?l=linux-net&m=120716473303355&w=2 ) that "The client network device "sees" the SYN/ACK packet, but it is not processed by the network stack", and the LINGER workaround, all make me suspect a race somewhere thats consuming the extra SYN somewhere that it perhaps shouldn't be, and the following call to sys_connect ends up being a victim. Anyway, I'm a bit outta my depth at this point and hope this can help other, more knowledgeable folk looking into this problem! cheers. -- Nathan
#include <stdio.h> #include <stdlib.h> #include <ctype.h> #include <unistd.h> #include <fcntl.h> #include <signal.h> #include <sys/time.h> #include <sys/types.h> #include <sys/socket.h> #include <sys/un.h> #include <sys/wait.h> #include <netinet/tcp.h> #include <netinet/in.h> #include <arpa/inet.h> void linger(int sock, int ltime) { struct linger linger; linger.l_onoff = 1; linger.l_linger = ltime; setsockopt(sock, SOL_SOCKET, SO_LINGER, (char *)&linger, sizeof(linger)); } int main(int argc, char **argv) { /* edit Server and Port */ char *server = "127.0.0.1"; int port = 25; int sock; struct sockaddr_in sock_addr; int count = 0; int lingertime = argc > 1 ? atoi(argv[1]) : 0; struct timeval tim; while(1) { gettimeofday(&tim, NULL); double t1 = tim.tv_sec + (tim.tv_usec/1000000.0); if ((sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP)) < 0) { fprintf(stderr, "ERROR: cannot create socket\n"); exit(1); } memset(&sock_addr, 0, sizeof (sock_addr)); sock_addr.sin_family = AF_INET; sock_addr.sin_addr.s_addr = inet_addr(server); sock_addr.sin_port = htons(port); if (connect(sock, (struct sockaddr *)&sock_addr, sizeof(sock_addr)) < 0) { fprintf(stderr, "ERROR: cannot connect to %s\n", server); exit(1); } if (argc > 1) linger(sock, lingertime); close(sock); gettimeofday(&tim, NULL); double t2=tim.tv_sec+(tim.tv_usec/1000000.0); if ((t2-t1) > 1) { printf("%s(%d): %.6lf seconds elapsed after %d iterations\n", server, port, t2-t1, count); exit(1); } count++; if (!(count % 1000)) printf("%s: %d connections completed\n", server, count); } return 0; }
Attachment:
socket.stap
Description: Binary data