On 06.03.2012 18:26, Tom Evans wrote:
On Tue, Mar 6, 2012 at 1:44 PM, Tom Evans<tevans.uk@xxxxxxxxxxxxxx> wrote:On Tue, Mar 6, 2012 at 1:01 PM, Tom Evans<tevans.uk@xxxxxxxxxxxxxx> wrote:So, we've been trying to track disappearing requests. We see lots of requests that go via the CDN to reach our data centre failing with error code 503. This error message is produced by the CDN, and the request is not logged in either of the FEPs. We've been trying to track what happens with tcpdump running at SQUID and at FW. At SQUID, we see a POST request for a resource, followed by a long wait, and then a 503 generated by the CDN. Interestingly, 95% of the failing requests are POST requests. Tracking that at FW, we see the request coming in, and no reply from the FEP. The connection is a keep-alive connection, and had just completed a similar request 4 seconds previously, to which we returned a 200 and data. This (failing) request is made on the same connection, we reply with an ACK, then no data for 47 seconds (same wait as seen by squid), and finally the connection is closed with a FIN.Sorry, one final thing - we can see these hanging connections on the FEP: netstat -an | head -n 2 ; netstat -an | fgrep EST | fgrep -v "tcp4 0" This shows the established sockets with unread recv-q. Obviously not every socket shown is hanging; but by observing it over an extended (10s) period, you can quickly see connections whose recv-q is not drained.A final follow up for today. We have dramatically* improved the error rates by tuning the event MPM, so that child processes were not being constantly reaped and re-spawned. In brief, we massively increased MaxSpareThreads, so that it wouldn't start reaping until more than 75% of potential workers (MaxClients) are idle. We're now running: StartServers 8 MaxClients 1024 MinSpareThreads 128 MaxSpareThreads 768 ThreadsPerChild 64 We now are not seeing apache children getting reaped or re-spawned (good!) and we're also not seeing any hanging established connections with unread recv-q, nor any failures from our squid proxy (good!). I don't think we've solved anything though, I think we have just engineered a sweet spot where the problems do not occur (not good!). Our tentative hypothesis for what is happening is this. Apache notices that there are too many idle workers, and decides to shutdown one of the processes. It marks that process as shutting down, and no new requests are allocated to workers from that process. Meanwhile, a keep-alive socket which is allocated to that child process comes alive again, and a new request is pushed down it. Apache never bothers to read the request, as the child is marked as shutting down. Once the child does finish all outstanding requests, the child does indeed shut down, and the OS sends a FIN packet to shut down the unread socket. Does this sound remotely possible? I would really appreciate some advice/insight here.
Yes it reminds me of a similar observation using event on 2.2 about 1-2 years ago. When Apache needed to recycle a process due to either spare thread checking, max requests per child or just a graceful restart, the event MPM handled existing keep alive connections more unfriendly than other MPMs. The expectation though was that an HTTP client using Keep-Alive should be able to resend a failed request on another connection, because race conditions can not be avoided with HTTP keep alive (it is always possible, that the server closes the connection and in parallel the client started to send the next request). It's just that event seems to show that behaviour more often than other MPMs.
I did not yet check, whether the overhauled event MPM in 2.4 handles this better.
Originally I was able to reproduce this behaviour by: - setting MaxRequestsPerChild to triger frequent process recycling- Increase Keep Alive count and timeout to allow longer than usual keep alive usage
- Adding the process PID to the AccessLog LogFormat (%P) - Adding the keep alive count of a connection to the AccessLog LogFormat - (I think) adding the client connection port to the AccessLog LogFormat- adding some log statements to ab.c, which shows local port number when sending a request on an established connection fails
It could be seen, that connections were exactly failing when processes were stopped and that the connections were the one that did not completely exhaust there allowed keep alive count.
If you can easily reproduce the problem and have a test environment, it would be very interesting whether 2.4 behaves better.
Adjusting min and max spare (increasing the difference between the two) threads to reduce creation and destruction of processes is a good optimization nevertheless.
Regards, Rainer --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx