Great work Nathan! fwiw I am able to replicate Nathan's results on our production boxen, running FC5 2.6.17 kernels (2.6.17-1.2171_FC5_PDcentossmp). I don't know if this is helpful, but I ran leo.c without arguments multiple times on our entire server farm (100 times on 205 machines) and saw an interesting pattern. 85% of the time, the script exited after exactly 14 iterations. The remaining 15% of tests took anywhere from 13-40, but none over 40. Again, this may mean nothing but it also might help someone track down the source of the issue. When leo.c was run with an argument it generally ran until manually terminated. Sample output: [web@ws500_multiply:multiply ~]$ /pd/global/multiply/scripts/leo 127.0.0.1(25): 3.000677 seconds elapsed after 14 iterations [web@ws500_multiply:multiply ~]$ /pd/global/multiply/scripts/leo 1 127.0.0.1: 1000 connections completed 127.0.0.1: 2000 connections completed 127.0.0.1: 3000 connections completed 127.0.0.1: 4000 connections completed 127.0.0.1: 5000 connections completed 127.0.0.1: 6000 connections completed 127.0.0.1: 7000 connections completed 127.0.0.1: 8000 connections completed 127.0.0.1: 9000 connections completed 127.0.0.1: 10000 connections completed 127.0.0.1: 11000 connections completed . . . On Wed, Aug 19, 2009 at 9:25 PM, Nathan Scott<nscott@xxxxxxxxxx> wrote: > Hi Leo, > > Thanks for the info... > > ----- "Leo" <neleo@xxxxxxx> wrote: > >> Dear Nathan, >> >> unfortunately nobody has answered to my first post, but at the same >> time there was another thread about 3sec timeouts. Therefore we have >> discussed the issue on this thread: >> >> http://marc.info/?t=120655182600018&r=1&w=2 >> >> Please read this thread for further details. So far we have a >> workaround but no solution >> (http://marc.info/?l=linux-net&m=121803213124143&w=2). >> But AFAIK the current kernel versions are not affected anymore. >> >> Hope this helps! >> >> Kind regards, >> Leo >> >> >> Nathan Scott wrote: >> > Hi Leo, >> > >> > I came across your mail related to a 3 second connect >> > delay from early last year: >> > http://marc.info/?l=linux-net&m=120214156112338&w=2 >> > >> > Did you ever get to the bottom of this? We're also now >> > observing it in our environment (we're running RHEL5, so >> > yet another, and different, kernel version to you). >> > >> > thanks! >> > > > I've been digging into this a little too, so it was good to read > of other peoples experiences and analysis of this issue so far. > I have observed it as recently as fc11 (2.6.29.6-217, so its not > clear whether it is fixed yet or not - I've not tried a current > mainline kernel as yet though). > > I came across another workaround, and have some more information > that might help the networking folks zero in on this some more, > though I'm quickly bit out of my depth in some of the networking > code. > > I've attached a slimmed-down and modified version of your test > case (leo.c). With no arguments, this is your original test case > and it hits the problem very quickly. With an argument, this now > sets SO_LINGER, which seems to avoid the problem... > > $ gcc -g -Wall -o leo leo.c > $ ./leo > 127.0.0.1(25): 3.000366 seconds elapsed after 262 iterations > $ ./leo 1 > 127.0.0.1(25): 1000 connections completed > 127.0.0.1(25): 2000 connections completed > 127.0.0.1(25): 3000 connections completed > 127.0.0.1(25): 4000 connections completed > 127.0.0.1(25): 5000 connections completed > 127.0.0.1(25): 6000 connections completed > 127.0.0.1(25): 7000 connections completed > 127.0.0.1(25): 8000 connections completed > 127.0.0.1(25): 9000 connections completed > 127.0.0.1(25): 10000 connections completed > 127.0.0.1(25): 11000 connections completed > > I've also attached a systemtap script which I've been using to > attempt to find the source of the delay, and to follow the code > in the kernel thats executing when the test case fails. > > Below is some of its output, from the tail of a failed run. It > suggests to me that perhaps we have a race condition here, in > the cleanup following a close(2) on the socket file descriptor. > If we set SO_LINGER, then close() will wait "until all queued > messages for the socket have been successfully sent or the > linger timeout has been reached." The systemtap trace points > toward inet_stream_connect() as the source of the 3sec delay - > it calls inet_wait_for_connect() which AIUI loops waiting for a > TCPF_SYN_SENT|TCPF_SYN_RECV state. I'm guessing that somehow > the messages which would normally set those state are consumed > in a different part of the code when the race occurs. It will > need someone with stronger net-fu than me to figure that out, > though. > > Here's the output from the attached stap script, anyway, with a > few random thoughts interspersed in braces... > > $ sudo stap socket.stap > ... > sys_connect: elapsed 70us (started 1250728561153903us, ended 1250728561153973us): leo > > [thats the end of the previous, non-failing connect(2) call)] > > lock_sock started: 1250728561153977 ns > lock_sock: elapsed 1us (started 1250728561153977us, ended 1250728561153978us): leo > release_sock: elapsed 1us (started 1250728561153980us, ended 1250728561153981us): leo > lock_sock started: 1250728561153985 ns > lock_sock: elapsed 1us (started 1250728561153985us, ended 1250728561153986us): leo > sk_stream_wait_close: elapsed 1us (started 1250728561153993us, ended 1250728561153994us): leo > release_sock: elapsed 1us (started 1250728561153996us, ended 1250728561153997us): leo > tcp_close: elapsed 15us (started 1250728561153984us, ended 1250728561153999us): leo > inet_release: elapsed 18us (started 1250728561153983us, ended 1250728561154001us): leo > sock_close: elapsed 27us (started 1250728561153976us, ended 1250728561154003us): leo > lock_sock started: 1250728561154023 ns > lock_sock: elapsed 2us (started 1250728561154023us, ended 1250728561154025us): leo > tcp_connect: elapsed 14us (started 1250728561154032us, ended 1250728561154046us): leo > tcp_v4_connect: elapsed 20us (started 1250728561154029us, ended 1250728561154049us): leo > release_sock: elapsed 1us (started 1250728561154068us, ended 1250728561154069us): leo > schedule_timeout CALLED BY unknown, 9223372036854775807 timeo > > [bzzt. note the start timestamp on schedule() below - 125072856_4_... > compared to the end of tcp_v4_connect - 125072856_1_... -- 3 seconds. > And release_sock, called just before schedule_timout in the call to > inet_wait_for_connect has the ..._1_... timestamp - so, pretty sure we > lose the 3 seconds right there] > > schedule: elapsed 6us (started 1250728564153581us, ended 1250728564153587us): leo > schedule_timeout CALLED BY inet_stream_connect, 9223372036854775807 rval > Returning from: 0xffffffff800637c9 : schedule_timeout+0x0/0xad > Returning to : 0xffffffff800591e6 : inet_stream_connect+0x132/0x236 > 0xffffffff8009df5d : autoremove_wake_function+0x0/0x2e > 0xffffffff8007e833 : kretprobe_trampoline_holder+0x0/0x2 > 0xffffffff800b4153 : audit_syscall_entry+0x16e/0x1a1 > 0xffffffff8007e833 : kretprobe_trampoline_holder+0x0/0x2 > schedule_timeout: elapsed 69us (started 1250728564153545us, ended 1250728564153614us): leo > lock_sock started: 1250728564153619 ns > lock_sock: elapsed 2us (started 1250728564153620us, ended 1250728564153622us): leo > release_sock: elapsed 0us (started 1250728564153629us, ended 1250728564153629us): leo > inet_stream_connect: elapsed 2999617us (started 1250728561154023us, ended 1250728564153640us): leo > > [note elapsed time above - 3sec somewhere inside inet_stream_connect, > but in the second half of that routine, as it calls tcp_v4_connect and > that shows an early-on timestamp at its end] > > Returning from: 0xffffffff800590b4 : inet_stream_connect+0x0/0x236 > Returning to : 0xffffffff8020d9db : sys_connect+0x7e/0xae > 0xffffffff800b4153 : audit_syscall_entry+0x16e/0x1a1 > 0xffffffff8007e833 : kretprobe_trampoline_holder+0x0/0x2 > > [we also don't see the final elapsed time for sys_connect now, as > my systemtap script calls exit as soon as any function takes more > than a second to complete]. > > > This trace data, with Gabriels observation ( in this message: > http://marc.info/?l=linux-net&m=120716473303355&w=2 ) that "The > client network device "sees" the SYN/ACK packet, but it is not > processed by the network stack", and the LINGER workaround, all > make me suspect a race somewhere thats consuming the extra SYN > somewhere that it perhaps shouldn't be, and the following call > to sys_connect ends up being a victim. > > Anyway, I'm a bit outta my depth at this point and hope this can > help other, more knowledgeable folk looking into this problem! > > cheers. > > -- > Nathan > -- Brett Paden paden@xxxxxxxxxxxx http://foonasty.multiply.com -- To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html