(Sorry for a rather long email, here's an "executive summary": Windows firewall doesn't reply with RST for TCP retransmissions on a client-closed connection, causes apache workers to get stuck for 5 minutes) Hello list, lately I've been trying to track down spurious apparent freezes in an application running on Apache + PHP. In short it seems like apache (or the kernel?) in some cases fails to detect when a client closes a connection mid-download, and leaves the worker stuck for several minutes trying to write the full http response. The problem is amplified by the fact that PHP keeps its session file flock()'ed while this happens, which means any further requests from the client never get answered (at least until the stuck worker times out). Please note that although this post focuses on PHP, I'm pretty sure this problem is not specific to that scripting language. Interestingly this occurs much more frequently if the client is running the Windows XP SP2 built-in firewall. I'll get back to that shortly. Here is a small test case: index.php contains: <? session_start(); ?> <html> <body> Test case for hang <? for ($i=0; $i<100; $i++) { echo $i.'<img src="noimg.php?i='.$i.'"><br>'; } ?> </body> </html> noimg.php contains: <? session_start(); ?> <html> <body> <? for ($i=0;$i<1000;$i++) { echo "$i: asdf asdf asdf asdf asdfasdf asdf asdf asdf asdf asdf asdf asdf asdf\n"; } ?> </body> </html> (I know it's silly to return a text/html page to be loaded in an <img src>, but that's necessary to cause the client to abort the connection mid-download. This actually happened in real life due to an erroneous <img src=""> tag which caused the browser to load the *current URL* as an image) My test setup consists of: * Server: Apache HTTPd 2.2.3 compiled from source, running on Debian GNU/Linux stable with kernel 2.4.33.3 * Client: FireFox 2.0 on Windows XP SP2, with the Windows firewall enabled. Server and client are placed on the same LAN. What happens when pointing firefox to the index.php given above, is that it starts to load the various <img> tags, but aborts the connection for each image mid-download, probably because it detects the mime type text/html. The problem is that after a couple of requests, apache fails to detect that the client has closed the connection, so firefox tries to load the next image, but the previous "image" (PHP script) is still running and keeping the session locked, so any further requests from the client just "hangs". I did a wireshark capture on the client while executing the test case, and here is an excerpt (I could probably sanitize out passwords etc and provide a full .pcap file, should that be necessary): (10.0.0.43 is the server, 10.0.0.138 is the client, '>' marks the most interesting packets) 596 9.729025 10.0.0.138 -> 10.0.0.43 TCP 2623 80 2623 > 80 [SYN] Seq=0 Len=0 MSS=1460 600 9.751507 10.0.0.43 -> 10.0.0.138 TCP 80 2623 80 > 2623 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 601 9.751546 10.0.0.138 -> 10.0.0.43 TCP 2623 80 2623 > 80 [ACK] Seq=1 Ack=1 Win=64512 Len=0 602 9.760570 10.0.0.138 -> 10.0.0.43 HTTP 2623 80 GET /noimg.php?i=45 HTTP/1.1 603 9.762774 10.0.0.43 -> 10.0.0.138 TCP 80 2623 80 > 2623 [ACK] Seq=1 Ack=464 Win=6432 Len=0 604 9.803347 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP segment of a reassembled PDU] 605 9.810978 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP segment of a reassembled PDU] 606 9.811029 10.0.0.138 -> 10.0.0.43 TCP 2623 80 2623 > 80 [ACK] Seq=464 Ack=2921 Win=64512 Len=0 >607 9.813999 10.0.0.138 -> 10.0.0.43 TCP 2623 80 2623 > 80 [FIN, ACK] Seq=464 Ack=2921 Win=64512 Len=0 608 9.814405 10.0.0.138 -> 10.0.0.43 TCP 2624 80 2624 > 80 [SYN] Seq=0 Len=0 MSS=1460 609 9.819676 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP segment of a reassembled PDU] >610 9.819725 10.0.0.138 -> 10.0.0.43 TCP 2623 80 2623 > 80 [RST, ACK] Seq=465 Ack=4381 Win=0 Len=0 611 9.826157 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP segment of a reassembled PDU] 612 9.859980 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Previous segment lost] 80 > 2623 [ACK] Seq=7301 Ack=465 Win=6432 Len=0 >613 10.072413 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] >614 10.568147 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] >620 11.558401 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] 623 12.785586 10.0.0.138 -> 10.0.0.43 TCP 2624 80 2624 > 80 [SYN] Seq=0 Len=0 MSS=1460 624 12.786740 10.0.0.43 -> 10.0.0.138 TCP 80 2624 80 > 2624 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 625 12.786768 10.0.0.138 -> 10.0.0.43 TCP 2624 80 2624 > 80 [ACK] Seq=1 Ack=1 Win=64512 Len=0 626 12.789319 10.0.0.138 -> 10.0.0.43 HTTP 2624 80 GET /noimg.php?i=46 HTTP/1.1 627 12.790579 10.0.0.43 -> 10.0.0.138 TCP 80 2624 80 > 2624 [ACK] Seq=1 Ack=464 Win=6432 Len=0 >628 13.569077 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] >637 17.570187 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] >638 25.575101 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] >639 41.563376 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] >650 137.533500 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] >651 257.507283 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] >652 377.499848 10.0.0.43 -> 10.0.0.138 TCP 80 2623 [TCP Retransmission] [TCP segment of a reassembled PDU] (then things "unlock" and proceeds as normal for a while) The interesting parts to note here is that the client sends a FIN, ACK, then an RST,ACK, and then stays completely silent on port 2623. netstat -n on the client shows: TCP 10.0.0.138:2624 10.0.0.43:80 ESTABLISHED netstat -np on the server shows: tcp 0 0 10.0.0.43:80 10.0.0.138:2624 ESTABLISHED 6080/httpd tcp 1 10220 10.0.0.43:80 10.0.0.138:2623 CLOSE_WAIT 6075/httpd In other words, the client has completely "forgotten" the port-2623 connection, but the server still knows about it. Attaching to pid 6075 with gdb and running a stacktrace shows: #0 0x401fba18 in poll () from /lib/libc.so.6 #1 0x40093c78 in apr_wait_for_io_or_timeout (f=0x0, s=0x817f7c0, for_read=0) at support/unix/waitio.c:51 #2 0x4008efef in apr_socket_sendv (sock=0x817f7c0, vec=0xbfffbbf8, nvec=3, len=0xbfffbab8) at network_io/unix/sendrecv.c:208 #3 0x08073642 in writev_it_all (s=0x817f7c0, vec=0xbfffbbf0, nvec=4, len=8074, nbytes=0xbfffbb48) at core_filters.c:321 #4 0x08073fea in ap_core_output_filter (f=0x817fdd0, b=0x8185dd8) at core_filters.c:868 #5 0x0807ef11 in ap_pass_brigade (next=0x7531, bb=0x1) at util_filter.c:526 #6 0x0808f91a in ap_http_chunk_filter (f=0x8185f88, b=0x8185dd8) at chunk_filter.c:187 #7 0x0807ef11 in ap_pass_brigade (next=0x7531, bb=0x1) at util_filter.c:526 #8 0x0807ef11 in ap_pass_brigade (next=0x7531, bb=0x1) at util_filter.c:526 #9 0x080693ae in ap_content_length_filter (f=0x81927a8, b=0x8185dd8) at protocol.c:1338 #10 0x0807ef11 in ap_pass_brigade (next=0x7531, bb=0x1) at util_filter.c:526 #11 0x40048294 in apr_brigade_write (b=0x8185dd8, flush=0x807f050 <ap_filter_flush>, ctx=0xfffffffc, str=0x410e38a8 "218: asdf asdf asdf asdf asdfasdf asdf asdf asdf asdf asdf asdf asdf asdf\n", nbyte=74) at buckets/apr_brigade.c:400 #12 0x080697cd in buffer_output (r=0x8191a20, str=0x410e38a8 "218: asdf asdf asdf asdf asdfasdf asdf asdf asdf asdf asdf asdf asdf asdf\n", len=74) at protocol.c:1455 #13 0x080698db in ap_rwrite (buf=0xfffffffc, nbyte=74, r=0x7531) at protocol.c:1490 #14 0x40635137 in php_apache_sapi_ub_write (str=0xfffffffc <Address 0xfffffffc out of bounds>, str_length=74) at /devel2/x2www/src/php-5.2.0/sapi/apache2handler/sapi_apache2.c:78 (I won't bore the list with the output of "bt full", but you can get that at http://corehacker.com/~frode/apache-user/pollhang-bt-full.txt) So, apache is stuck in a poll() apparently waiting for the client to suck down whatever apache wants to write, but the client is long gone and we have to wait for the poll() to completely time out before the worker is freed. Interestingly enough, if the windows firewall is disabled on the client, there is no such long hang, because the client sends RST packets for each "[TCP Retransmission]" packet, so the socket closes down almost immediately on the server as well. I've reproduced exactly the same effect when running httpd 2.2.2 on FreeBSD 6.1, and also on httpd 1.3.34 (although this seemed to detect the closed client socket quicker) on the same Linux box. I failed to reproduce the effect when running the win xp sp2 + firefox + firewall client setup inside vmware, strangely enough. Does anyone have any tips on how to mitigate this problem (besides the obvious fix of "don't return text/html when the client Accepts: image/png")? I don't think "disable the client firewall" is a realistic answer for a public-facing web site. Anyway, I've gotten reports that disabling the firewall greatly improves things but the occasional hang still occurs. Also, isn't this sort of a weakness that makes it fairly easy to create a Denial of Service situation by "eating up" all workers with little effort? --------------------------------------------------------------------- The official User-To-User support forum of the Apache HTTP Server Project. See <URL:http://httpd.apache.org/userslist.html> for more info. To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx " from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx