[users@httpd] Responses truncated with Apache 2.0 over Gigabit Ethernet

Greg Ward <gward-apache@xxxxxxxxxx> · Thu, 3 Mar 2005 18:55:59 -0500

We have been observing strange problems with Apache 2.0 at one of our
client sites for many months now.  A bit of background: we sell
specialized server software that is currently running on hundreds of
machines at scores of customer sites.  Apache is used to serve up static
content as well as act as a front-end to Tomcat via mod_jk.  The problem
I'm about to describe has only been observed at one customer site, and
the most obvious unique property of this site is Gigabit Ethernet to
every server and client workstation.

The server in question is running Red Hat Linux 6.2 (kernel
2.2.14-5.0smp) with Apache 2.0.52.  Client workstations run various
flavours of Windows, but I have observed the problem when running my
test client on other Linux machines.

Here's what happens: a small fraction of HTTP responses are truncated
before the entire response body has been sent.  (The fraction seems to
vary from 1/20 to 1/10,000 depending on who's doing the testing, what
hardware is involved, phase of moon, etc.)

To diagnose the problem, I wrote a Python program that implements this
algorithm:

  for i = 1 .. M:
    open connection to <host>
    for j = 1 .. N:
      send request for <uri>
      read Content-Length
      read response body
      ensure number of bytes read == Content-Length
    close connection

So if all goes well, this requests the same file M*N times.

Failures always seem to come in pairs.  The first one looks like this:

  FAIL: read 110228 bytes (expected 131072) in 15.0 sec (7.2 kB/sec)

>From packet-sniffing (tcpdump on the server, ethereal on the client),
I've determined that the sequence of events for this failure is:

  * client sends request: "GET" + headers
  * server starts response "200 OK" + headers + first chunk of body
  * server sends most of body (1460 bytes per TCP segment), with a steady
    stream of TCP ACK segments from the client
  * server "freezes" for 15 sec and then sends a TCP FIN ACK segment --
    i.e. the connection is closed by the server
  * client gets end-of-file on next read and reports failure
    (bytes read != Content-Length)

The second failure appears to be an unavoidable consequence of the first
one: httplib.py (the standard Python HTTP client library) attempts to
read a response line from a closed socket and barfs:

  FAIL: HTTP error
  Traceback (most recent call last):
    File "./httptest.py", line 93, in run_connection
    File "./httptest.py", line 111, in send_request
    File "/usr/lib/python2.3/httplib.py", line 779, in getresponse
    File "/usr/lib/python2.3/httplib.py", line 273, in begin
    File "/usr/lib/python2.3/httplib.py", line 237, in _read_status
  BadStatusLine

Then the client falls out to the outer loop, catches the exception,
opens a new connection, and carries on quite happily.

>From client workstations running Windows, I would say that rather more
than 1/100 requests fail.  (Or perhaps I should say 2/100, since
failures come in pairs *with my test client* -- other HTTP clients might
do a better job of detecting a closed connection and opening a new one
automatically.)

>From other server machines (also Linux boxes), failures seem to be more
on the order of 1/10,000 requests.  That's based on two runs of 10,000
requests each, so hardly scientific.  This could be a question of
network hardware, network topology, device drivers, OS TCP stack,
... who knows.  I'm pretty sure it's *not* the HTTP client library,
though, since we have observed failures in Java programs, in C++
programs (using wininet as the client library), and in my Python test
program.

One more data point: the failure does not appear to happen with other
HTTP servers.  I wrote a trivial single-threaded HTTP 1.0 server in
Python, and we have not seen failures with it.  And many months ago we
experimented with connecting to Tomcat directly instead of going through
Apache, and the failures disappeared.  (There are various good reasons
to keep the dual Apache/Tomcat setup: SSL, CGI, mod_rewrite, ...)

Oh yeah, one more thing: this problem only started appearing when we
upgraded to Apache 2.0 (in order to use Tomcat).  Until about 18 months
ago, this server was running Apache 1.3 with JServ, and we never had a
problem.  (Apart from JServ being a pain in the neck. ;-)

So... has anyone else witnessed weird problems with Apache 2.0 over
gigabit networks?  My gut instinct says it's not *just* Apache, and not
*just* the hardware, and not *just* Linux, and not *just* Windows, but
some combination of those or various other factors.  Maybe Apache is
tickling the hardware (or the kernel) in a way that exposes bugs?

Any ideas are welcome!

        Greg

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
   "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx