Hello everyone,
We upgraded from Apache 2.4.12 to 2.4.18 on a public facing webserver which proxies requests to backend servers. Initially when we cut-over to the webserver running the newer version (2.4.18), all traffic seemed to flow normally. But a few days onwards,
we received a report from one of our customers that they were experiencing random outages. The outage would manifest itself in a browser page "This site can't be reached", "ERR_CONNECTION_TIMED_OUT". As far as we were aware, this is the only customer experiencing
this issue and to report of it. After looking through all available logs for Apache and otherwise, we could not identify what was causing this nor where this was occurring. So we decided to setup some packet capturing (tcpdumps) from both ends between us
and this customer. What we observed was the following:
Packet captures on border firewall showed the SSL handshake failing during ECDH negotiations, after the server hello message was received on the client. The return packet was a ‘bad_record_mac’ alert message, alert code 20.
Because of this, we decided to make the following changes:
During trouble shooting the TIME_WAIT value was increased on the firewall to allow enough time for a response, this did not resolve the issue. The firewall was then configured for TCP by-pass for the IP addresses having the communication issues, this did
not resolve the issue either. The firewall is a Cisco ASA 5545 running v 9.8(3)29.
While comparing the Apache setup we had running 2.4.12 and 2.4.18, we found out that we were running the "event" mpm on 2.4.18 vs "worker" mpm on 2.4.12. Reading on the differences between both of these mpm types, we immediately thought this could have
played a part in this because of how sockets are handled. We reverted the mpm back to "worker" on the newer Apache version. We tested again and this customer still experienced the same random issues.
Additional information:
- Customer uses one single destination IP address where all of these requests are coming from for all of their employees' traffic to access our application.
- There seems to be a correlation between high peak traffic time for this customer and the likely occurrence of these events. So as stated all traffic is coming from one single destination IP address and there could be 200+ users on our system at that
given time.
- Customer reports less occurrence of this issue outside of their high peak traffic times.
- We've tuned the ListenBacklog to 99999 with no noticeable impact on this issue, although we believe it could have played a part in a separate issue not within this scope.
Any help would greatly be appreciated as we are out of ideas and this customer has not been very friendly in helping us help them with this issue. We've had to revert back to running on Apache 2.4.12 which we would like to upgrade from.
Thank you,
Franck
|