On Tue, Mar 6, 2012 at 1:44 PM, Tom Evans <tevans.uk@xxxxxxxxxxxxxx> wrote: > On Tue, Mar 6, 2012 at 1:01 PM, Tom Evans <tevans.uk@xxxxxxxxxxxxxx> wrote: >> So, we've been trying to track disappearing requests. We see lots of >> requests that go via the CDN to reach our data centre failing with >> error code 503. This error message is produced by the CDN, and the >> request is not logged in either of the FEPs. >> >> We've been trying to track what happens with tcpdump running at SQUID >> and at FW. At SQUID, we see a POST request for a resource, followed by >> a long wait, and then a 503 generated by the CDN. Interestingly, 95% >> of the failing requests are POST requests. >> >> Tracking that at FW, we see the request coming in, and no reply from >> the FEP. The connection is a keep-alive connection, and had just >> completed a similar request 4 seconds previously, to which we returned >> a 200 and data. This (failing) request is made on the same connection, >> we reply with an ACK, then no data for 47 seconds (same wait as seen >> by squid), and finally the connection is closed with a FIN. >> > > Sorry, one final thing - we can see these hanging connections on the FEP: > > netstat -an | head -n 2 ; netstat -an | fgrep EST | fgrep -v "tcp4 0" > > This shows the established sockets with unread recv-q. Obviously not > every socket shown is hanging; but by observing it over an extended > (10s) period, you can quickly see connections whose recv-q is not > drained. > A final follow up for today. We have dramatically* improved the error rates by tuning the event MPM, so that child processes were not being constantly reaped and re-spawned. In brief, we massively increased MaxSpareThreads, so that it wouldn't start reaping until more than 75% of potential workers (MaxClients) are idle. We're now running: StartServers 8 MaxClients 1024 MinSpareThreads 128 MaxSpareThreads 768 ThreadsPerChild 64 We now are not seeing apache children getting reaped or re-spawned (good!) and we're also not seeing any hanging established connections with unread recv-q, nor any failures from our squid proxy (good!). I don't think we've solved anything though, I think we have just engineered a sweet spot where the problems do not occur (not good!). Our tentative hypothesis for what is happening is this. Apache notices that there are too many idle workers, and decides to shutdown one of the processes. It marks that process as shutting down, and no new requests are allocated to workers from that process. Meanwhile, a keep-alive socket which is allocated to that child process comes alive again, and a new request is pushed down it. Apache never bothers to read the request, as the child is marked as shutting down. Once the child does finish all outstanding requests, the child does indeed shut down, and the OS sends a FIN packet to shut down the unread socket. Does this sound remotely possible? I would really appreciate some advice/insight here. When I get a chance, I will try to engineer a config that puts httpd in this sort of state, and a test case that should expose this. Cheers Tom * So much so, that 20 minutes after making the changes, my boss suggested we all retire to the pub and celebrate. --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx