Re: question about 3sec timeouts with tcp

Gabriel Barazer <gabriel@xxxxxxxx> · Tue, 01 Apr 2008 22:13:43 +0200

On 04/01/2008 8:59:24 PM +0200, "H. Willstrand" <h.willstrand@xxxxxxxxx> 
wrote:
On Tue, Apr 1, 2008 at 8:43 PM, Gabriel Barazer <gabriel@xxxxxxxx> wrote:
On 04/01/2008 8:28:14 PM +0200, "H. Willstrand" <h.willstrand@xxxxxxxxx>

wrote:
 > On Tue, Apr 1, 2008 at 7:59 PM, Gabriel Barazer <gabriel@xxxxxxxx> wrote:
 >> On 04/01/2008 7:17:31 PM +0200, Leo <neleo@xxxxxxx> wrote:
 >>  > H. Willstrand wrote:
 >>  >> On Tue, Apr 1, 2008 at 5:43 PM, Gabriel Barazer <gabriel@xxxxxxxx> wrote:
 >>  >>
 >>  >>> On 04/01/2008 4:43:20 PM +0200, Brett Paden <paden@xxxxxxxxxxxx> wrote:
 >>  >>>  >> If I'm right Brett's problem relays in the test client (provided in
 >>  >>>  >> the first mail). This has probably to do with the number of ports
 >>  >>>  >> opened and closed during a short time period.
 >>  >>>  >
 >>  >>>  > My test client is designed to simulate the sort of load our
 >>  >>> production
 >>  >>>  > databases and web servers see.  We're talking on the order of 100-400
 >>  >>>  > connections per second.  On an unloaded server the 3000ms occur right
 >>  >>>  > around 400 connections a second but we have seen them a lower
 >>  >>> connection
 >>  >>>  > rates.  Are you suggesting that we could do something simple (like
 >>  >>> reap
 >>  >>>  > TIME_WAIT connections) to allevaite the problem?
 >>  >>>
 >>  >>>  Using tcp_tw_recycle / tcp_tw_reuse doesn't solve the problem either on
 >>  >>>  the client nor on the server. I tested with and without these options
 >>  >>>  enabled, disabled netfilter's connection tracking and none solved this
 >>  >>>  delay. If even the "lo" interface is concerned, there is definitely
 >>  >>>  something into the network stack and not the device drivers.
 >>  >>>
 >>  >>>  Here is a thread I started on LKML about this very same bug.
 >>  >>>  http://lkml.org/lkml/2008/3/14/353
 >>  >>>  There is a forum thread with french hosting providers talking about it.
 >>  >>>  (if some of you read french:
 >>  >>>  http://www.webmasterclub.fr/forum/topic,59486,0.html)
 >>  >>>
 >>  >>>  We are far from being alone!
 >>  >>>
 >>  > Welcome to the club, Gabriel!
 >>  >>>  Gabriel
 >>
 >>  How lucky I am!
 >>  I suspect there are many other people having this problem out there,
 >>  they just don't notice these delays on small infrastructures and because
 >>  this bug doesn't actually cause a connection error, but "only" an
 >>  unacceptable delay for moderate to high busy servers.
 >>
 >>
 >>  >> Ok, seams to be the same issue that Leo has (has nothing to do with
 >>  >> the Brett / Marlon issue, only common dominator is the 3000ms).
 >>  >>
 >>  > But Gabriel is also talking about 3 second timeouts on the client as
 >>  > Brett and I did. I have read Gabriel's  description on the provided link
 >>  > and it seems to be exactly the same problem. I think Brett can confirm
 >>  > this ...
 >>  >> This issue is probably caused by server delivering as miscalculated
 >>  >> SYN/ACK (the acked number is miscalculated, see my second mail).
 >>  >>
 >>  > When you look at my first tcpdump with two machines as server and client
 >>  > then you can see that there are no miscalculated SYN/ACK packets from
 >>  > the server (and therefore no RST packet from the client). All packets
 >>  > have the right number but the client never receives the SYN/ACK packet
 >>  > from the server. Only at the lo test there are RST packets and wrong
 >>  > packet numbers. But as I told you in my last email I think this is a
 >>  > different problem and not important for us. We should ignore the lo test
 >>  > and concentrate on the "real" problem of Brett, Gabriel and myself  (and
 >>  > even a lot of other people out there).
 >>
 >>  I confirm that there is no problem is the sequence numbers. Attached is
 >>  the pcap compatible capture of the relevant packets (608 bytes, 6
 >>  packets total: 2 for the failed handshake, 3 for the successful one and
 >>  1 for the first mysql data packet). This capture has been filtered to
 >>  show only the relevant packets and done in promiscuous mode.
 >>
 >

I'm missing the tcpdump...
 Sorry, I forgot to include it when reformatting my e-mail. Here it is!

 Gabriel

The packages are OK.
Still, how did you produce this situation? Let me guess, you used one
client to mass produce connections to your mysql-server, right?

I used a real-world setup and generated artifical traffic on it:
2 "client" servers , 1 mysql server. Connections were done the same way 
they are done in a real situation: cgi processes opening mysql 
connections to the server. Note that there are 2 "client" servers, and 
when the 3-sec delay occurs, it is only with one server at a time. The 
other is running fine and so does the mysql server, serving to the 
"good" client.
 I'm not sure what do you mean by "mass produce" connections, but even 
with only 200 connections/sec to the mysql-server the delay occured.

To detect&capture it on a production server, I run a monitoring script 
testing a CGI script on the servers which tries to connect to the mysql 
server, close the connection and exits. The monitoring script shows all 
requests whose execution time is greater than 2.5 seconds. Then I sit in 
front of this script running every second, wait a few minutes and when 
it start displaying long requests, I start a tcpdump capture on the 
client server. This is the less intrusive way I found to detect these 
delays on production servers. What is really odd, is no other connection 
suffers from the delay other than those made to the mysql-server . Any 
other connections (other mysql server, ssh, http). This could (pure 
speculation) indicate the problem is occurring only for a destination 
IP/port tuple at a time i.e. the mysql server ip/port 3306.

Gabriel
--
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html