Apache "marking down" a back-end server

Suvendu Sekhar Mondal <suv3ndu@xxxxxxxxx> · Wed, 1 Nov 2017 21:09:18 +0530

Hello Everyone,

I am seeing one interesting behavior of Apache httpd.

We have multiple Apache httpds in front of set of Tomcat JVMs. I found
that sometimes *one of the httpds marking one of the JVMs down* for
180 Sec("retry" value). As a result, users logged on that JVM are
getting 5xx error. First, I suspected that long GCs are causing it but
it was not the case. We have 5 Sec of "ping" timeout and GCs during
problem period was 500ms-700ms. Also there were plenty of threads
available in the JVM to cater new requests. After some more drill-down
it was found that each of those "mark down" incidents are correlated
with some really long processing(800 Sec) on JVM which surpasses our
"ProxyTimeout" and "ttl" limits. Yes, some of the workflows of our app
can take that much time if they are processing large volume - we are
working on it.

My understanding is, these are not "ping" failure case where httpd
marks the JVM down. Being said that, can it happen that either
"ProxyTimeout" or "ttl" failure instructing httpd to mark the JVM
down? Or, do you think it is something else? Please let me know.

httpd version: 2.4.10

httpd setting:
ProxyTimeout 300

<Proxy balancer://mycluster>
ProxySet lbmethod=byrequests
ProxySet stickysession=JSESSIONID|jsessionid
ProxySet scolonpathdelim=On
ProxySet growth=2
ProxySet nofailover=On

BalancerMember http://abc route=abc keepalive=on ttl=300 ping=5 retry=180

</proxy>

Excerpts from httpd Error log:
[Wed Nov 01 08:17:39.221276 2017] [proxy_http:error] [pid 31848:tid
9828] (OS 10060)A connection attempt failed because the connected
party did not properly respond after a period of time, or established
connection failed because connected host has failed to respond.  :
[client 10.254.52.48:13964] AH01102: error reading status line from
remote server abc, referer: xxx
[Wed Nov 01 08:17:39.221276 2017] [proxy:error] [pid 31848:tid 9828]
[client 10.254.52.48:13964] AH00898: Timeout on 100-Continue returned
by /xxx
[Wed Nov 01 08:17:39.221276 2017] [proxy_balancer:error] [pid
31848:tid 9828] [client 10.254.52.48:13964] AH01167:
balancer://mycluster: All workers are in error state for route (abc),
referer: xxx
[Wed Nov 01 08:17:39.346281 2017] [proxy_balancer:error] [pid
31848:tid 9760] [client 10.254.52.48:17783] AH01167:
balancer://mycluster: All workers are in error state for route (abc)

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx