On Sat, 16 Aug 2008, Dâniel Fraga wrote: > On Sat, 16 Aug 2008 22:18:50 +0300 (EEST) > "Ilpo Järvinen" <ilpo.jarvinen@xxxxxxxxxxx> wrote: > > > I'll look through 2.6.24..25 history once I have some time to see if > > there are some clues about the cause. I'm also having a problem in > > figurin out why would the frto patch you tested solve this issue (unless > > there are two issues in the picture). > > Ok, surely some patch between .24 and .25 caused this. Or it's > some bug that only "appeared" in .25 :) > > In fact, the frto patch helped, but not prevented the problem. > I mean, it seems that with the frto patch, the problem doesn't happen > frequently. And if I disable frto, the problem doesn't occur either. > > But, maybe, we could be talking about another bug, completely > unrelated to frto... I don't know. i'm just guessing ;). Anyway, we > talk about stalled connections ;) > > What I know is: > > 1) what you wrote is right: 2.6.24 is fine, 2.6.25 and 2.6.26 not > > 2) nmap -sS <server> seems to reset the connection (it's my workaround > until now ;). Maybe the ping probe help in some way? I don't know. Perhaps, though it's not at all clear how it could do that... > I want to help you as much as I can. So, ask anything you need. I went through TCP related and inet_connection_sock related things, nothing obvious I could notice in there... Do you have net namespaces enabled CONFIG_NET_NS in .config? Any netfilter (iptables) rules on server which could cause those packets to not reach TCP layer? MIBs might give some clue why those segments didn't get accepted. Most interesting ones are PAWSEstab, TCPAbortOnSyn and InErrs. One can use /bin/cut to read those from the one-line files if one wants to (however, I attached a script which transposes them to get them somewhat human-readable). Also having the /proc/net/tcp output from the server while stalling would be (have been) useful to reveal state info (but I should have remembered to ask you to run it on both of them :-)). Also, I wonder what that [|tcp] hides, e.g., "<nop,nop,timestamp 15980976 70381399,nop,nop,[|tcp]>" in tcpdump (and that was for an ACK which doesn't make too much sense to me there). It occurs because snaplen which was given for tcpdump is small enough to make TCP header partial. -- i.
Attachment:
readmibs.sh
Description: Bourne shell script