Re: Disappearing requests / tuning event MPM, 2.2.22

Tom Evans <tevans.uk@xxxxxxxxxxxxxx> · Tue, 6 Mar 2012 17:26:43 +0000

On Tue, Mar 6, 2012 at 1:44 PM, Tom Evans <tevans.uk@xxxxxxxxxxxxxx> wrote:
> On Tue, Mar 6, 2012 at 1:01 PM, Tom Evans <tevans.uk@xxxxxxxxxxxxxx> wrote:
>> So, we've been trying to track disappearing requests. We see lots of
>> requests that go via the CDN to reach our data centre failing with
>> error code 503. This error message is produced by the CDN, and the
>> request is not logged in either of the FEPs.
>>
>> We've been trying to track what happens with tcpdump running at SQUID
>> and at FW. At SQUID, we see a POST request for a resource, followed by
>> a long wait, and then a 503 generated by the CDN. Interestingly, 95%
>> of the failing requests are POST requests.
>>
>> Tracking that at FW, we see the request coming in, and no reply from
>> the FEP. The connection is a keep-alive connection, and had just
>> completed a similar request 4 seconds previously, to which we returned
>> a 200 and data. This (failing) request is made on the same connection,
>> we reply with an ACK, then no data for 47 seconds (same wait as seen
>> by squid), and finally the connection is closed with a FIN.
>>
>
> Sorry, one final thing - we can see these hanging connections on the FEP:
>
> netstat -an | head -n 2 ; netstat -an | fgrep EST | fgrep -v  "tcp4       0"
>
> This shows the established sockets with unread recv-q. Obviously not
> every socket shown is hanging; but by observing it over an extended
> (10s) period, you can quickly see connections whose recv-q is not
> drained.
>

A final follow up for today. We have dramatically* improved the error
rates by tuning the event MPM, so that child processes were not being
constantly reaped and re-spawned.

In brief, we massively increased MaxSpareThreads, so that it wouldn't
start reaping until more than 75% of potential workers (MaxClients)
are idle. We're now running:

StartServers 8
MaxClients 1024
MinSpareThreads 128
MaxSpareThreads 768
ThreadsPerChild 64

We now are not seeing apache children getting reaped or re-spawned
(good!) and we're also not seeing any hanging established connections
with unread recv-q, nor any failures from our squid proxy (good!). I
don't think we've solved anything though, I think we have just
engineered a sweet spot where the problems do not occur (not good!).

Our tentative hypothesis for what is happening is this. Apache notices
that there are too many idle workers, and decides to shutdown one of
the processes.
It marks that process as shutting down, and no new requests are
allocated to workers from that process.
Meanwhile, a keep-alive socket which is allocated to that child
process comes alive again, and a new request is pushed down it.
Apache never bothers to read the request, as the child is marked as
shutting down.
Once the child does finish all outstanding requests, the child does
indeed shut down, and the OS sends a FIN packet to shut down the
unread socket.

Does this sound remotely possible? I would really appreciate some
advice/insight here.

When I get a chance, I will try to engineer a config that puts httpd
in this sort of state, and a test case that should expose this.

Cheers

Tom

* So much so, that 20 minutes after making the changes, my boss
suggested we all retire to the pub and celebrate.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx