[users@httpd] After a couple days httpd child processes go 100% CPU

Damon Maria <damon@xxxxxxxxxxxxxx> · Fri, 22 Apr 2005 00:31:20 +1200

Sorry for the not very descriptive subject, but this is a hard to 

describe problem...

Some basic details:
- httpd-2.0.54 (although I've had this problem since at least 2.0.49)
- Redhat 9.0 (kernel 2.4.20-8)
- 2.4GHz P4 256M mem.

After some time (usually a couple of days) of running fine and averaging 

about 0.5% CPU usage httpd suddenly goes mad. A particular child process 

will be at close to 100% CPU usage. It will stay like that for a couple 

of seconds before returning to normal CPU usage as another child process 

goes to 100% CPU for a couple seconds. And so on and so on.

A server restart fixes the problem. Once it took 2 restarts. Even a 

graceful restart will do it but results in issues with a global mutex 

lock related to the SSLSessionCache which hasn't been released.

I added in a cronjob to restart the server every night at 5:00am. That 

seems to hold the problem at bay for a while but eventually it happens 

within the 24 hours between restarts (this time it was about 11 hours 

after the cronjob restart). So the length of time it takes to show up 

the problem isn't constant.

This issue has been around for quite a few months now. At first it only 

happened occasionally but now it's now getting more and more frequent - 

I guess tracking with the increase of traffic on our site. I tried 

reverting back to 2.0.49 but that made no difference. I've since 

upgraded with each release but still have the problem. This site has 

been running on apache 2 for a couple of years now.

I've written a script that continually dumps out the result of the 

server-status handler with ExtendedStatus On and runs top in batch mode 

piping it out to a log. Last time the issue happened I ran this script 

so I'd have a record of what CPU usage there was per process 

second-by-second as well as what those processes were doing according to 

httpd.

The output of top shows only one httpd child process is ever running 

100% CPU, and it changes to a different child process every couple of 

seconds. The line from server-status for that process always looks the 

same, something like:

Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request
68-1 22174 0/1/1 R 0.00 73 578 0.0 0.01 0.01 ? ? ..reading..

As you can see, it has used no CPU and has been sitting there for 73 

seconds handling this request (SS). If I trace this process (68-1 22174) 

through page after page of output from server-status the SS value 

(Seconds since beginning of most recent request) increases but none of 

the other stats change. Strange: the process was listed as close to 100% 

CPU for a couple of seconds by top but this doesn't show up in any of 

the stats from server-status.

As the client, vhost and request are never shown for these runaway 

processes I can't even figure out in which area of our configuration 

these requests are hitting.

By the time I was notified to the problem and got the script running 

there were already ~120 children looking like this and only 7 children 

listed with any request information. From the header of server-status:

RRRRRRRRRRRRRRRWRRRRRRRRRRCRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRKRRRRRRRRRRRRRRRRRRRWWR___RR
RR_____.........................................................
................................................................

Also from the log of top I can see that overall the machine is 0% idle, 

and 0% io wait, but about 85% system and only 15% user. Although I'm not 

sure what (if anything) that means. The memory is close to full but only 

8M of swap is being used, so I don't think that's a problem.

I'd like to run through an aspect of our setup that is not so ordinary, 

as this may give someone a clue as to what's going on. We run SSL on two 

ports: 443 and 444. This is to get around this bug:

http://issues.apache.org/bugzilla/show_bug.cgi?id=12355

Some URLs on 443 require a client certificate, and these always use GET 

rather than POST (because of the above bug). We also proxy another site 

(that is out of our control) which we must secure with client 

certificates. As this site uses POST we have to put this onto a virtual 

host on a different port so we can specify SSLVerifyClient require for 

that entire <virtualhost>, rather than for particular <location>s.

Our (truncated) SSL setup:

SSLSessionCache dbm:/var/log/httpd/ssl_cache
SSLSessionCacheTimeout 600
SSLVerifyDepth 1
SSLCACertificateFile XXXXXXX
SSLVerifyClient none

SSLCipherSuite 

ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL

SSLCertificateFile XXXXX
SSLCertificateKeyFile XXXXX
SSLCertificateChainFile XXXXX
<VirtualHost *:443>
  ServerName www.motorweb.co.nz
  SSLEngine on
  <LocationMatch "^/YYYY/ZZZZ">
    SSLVerifyClient require
    SSLOptions +OptRenegotiate +StdEnvVars
  </LocationMatch>
  ... several other <location>s like the one above
</VirtualHost>
<VirtualHost *:444>
  ServerName www.motorweb.co.nz
  SSLEngine on
  SSLVerifyClient require
  SSLOptions +OptRenegotiate +StdEnvVars
</VirtualHost>

Side note: we used to have only a single IP address which is why we used 

another port rather than a different IP for the proxied site. We now 

have multiple IPs and so will be moving this proxying onto a different 

IP on port 443 at some stage.

To make our configuration even more complex we also have to 

screen-scrape the pages being returned by the proxied site. To do this I 

wrote a perl module that uses PerlOutputFilterHandler (although, to be 

honest, I know close to nothing about perl).

Our perl setup (using mod_perl-2.0.0-RC4):

LoadModule perl_module        modules/mod_perl.so
PerlSwitches -I/opt/motorweb/modperl
PerlModule Apache2
PerlModule Motorweb::WOFFilter
<VirtualHost *:444>
  ...
  <Location /abc/>
    ProxyPass http://XXXXX/
    ProxyPassReverse http://XXXXX/
    PerlOutputFilterHandler Motorweb::WOFFilter
  </Location>
</VirtualHost>

This perl module had been running fine for about 6 months before this 

issue turned up, and hasn't been changed since. So I don't think that 

it's the problem.

I have suspected it may be something to do with SSL as once when this 

happened I could access a page fine through http:// but the same page 

through https:// wouldn't return anything (browser would timeout). 

However, on other occasions even http:// doesn't work (although this may 

be because there just isn't the CPU time available).

Our SSLSessionCache files also get really big (ssl_cache.pag is 

currently around 500MB), and while this issue was happening one time I 

checked and the ssl_cache.dir was suspiciously exactly 65536 bytes. But 

it gets recreated on a restart and always comes back to the same size, 

so I guess they're supposed to be that big? Is there a limit on the size 

of the cache or their files? I've also noticed the occasional line like 

this in the error_log (about once a day, sometimes more, sometimes less):

[Wed Apr 20 16:00:18 2005] [error] (20014)Error string not specified 

yet: Cannot store SSL session to DBM file `/var/log/httpd/ssl_cache'

Also remember my note up the top about when I tried a graceful restart. 

I guess the runaway child processes were still hanging around and 

holding onto a global mutex lock to the SSLSessionCache because I got 

the following lines in the error log:

Mon Apr 04 09:22:54 2005] [notice] Graceful restart requested, doing restart

[Mon Apr 04 09:22:57 2005] [warn] (43)Identifier removed: Failed to 

acquire global mutex lock

[Mon Apr 04 09:22:57 2005] [error] (13)Permission denied: Cannot open 

SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for scanning

[Mon Apr 04 09:22:57 2005] [warn] (22)Invalid argument: Failed to 

release global mutex lock

And then the same last three lines repeated over and over and over. 

Although usually "for scanning" was "for reading", but became almost 

entirely "for writing" near the end, and sometimes the acquire error was 

prefixed by "(22)Invalid argument" rather than "(43)Identifier removed". 

This went on for 6 seconds before clearing itself up. The last bit gets 

interesting and I've added those log lines to the end of this post 

because it also relates to some other stuff I want to tell you about first.

Does this give me a hint that it's something to do with the SSL cache? 

Presumably one of the child processes that had gone wild was holding 

onto this mutex to produce this problem after the restart indicating 

that that is the point it had 'locked up'?

See also my notes below about other SSL problems I've had.

Speaking of the error log, here are the other errors (besides "File does 

not exist") that regularly appear in the logs:

[Thu Apr 21 17:08:24 2005] [error] Re-negotiation handshake failed: Not 

accepted by client!?

I get thousands of these (more than 50 an hour at peek times). SSL does 

work fine and so do our client certificates, our users are the kind that 

tell us if it's not working. I've tried to figure out what this means 

but all other posters on the net with this error get additional error 

messages with it. All I get is that one error over and over again.

Really the only other error that fills the error_log is something along 

the lines of this:

[Wed Jan 05 08:39:44 2005] [error] [client 219.89.127.102] 

Apache::Filter::print: (103) Software caused connection abort at 

/opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer: 

https://www.motorweb.co.nz:444/xxx

Apache::Filter: (103) Software caused connection abort at -e line 0[Wed 

Jan 05 08:42:10 2005] [error] Re-negotiation handshake failed: Not 

accepted by client!?

That's not my email formatting either. "Apache::Filter: (103)..." is on 

a new line but "[Wed Jan 05 08:42:10 2005] [error]..." is *not*. I guess 

something isn't adding newlines into the log properly. I've always 

assumed that this is simply the user refreshing/stopping their browser 

half way through a page being filtered. But I'm not sure. This "Software 

caused connection abort" in my perl filter happens a lot when this 100% 

CPU issue is happening. But that I presume is just because users are 

getting sick of not seeing anything and hitting refresh.

One more thing about my setup. I use the java servlet container Resin 

which has an Apache module mod_caucho. This is a small and simple little 

module that passes requests off to a java based server on another 

machine. I don't think this is the problem because if the request has 

been parsed to the point where httpd can decide to use the handler 

caucho-request then I'd except to be seeing the request URL in the 

server-status output.

Another issue relating to SSL and 443/444.

I did previously strike what I thought may be a bug in httpd with 

running the two different SSL servers inside the one httpd instance. In 

some versions of IE, if a user goes to our https 443 site, but *does 

not* access any locations that use SSLVerifyClient, and then goes to the 

444 site (which is SSLVerifyClient require) then this may cause the 

httpd process they connected to on 444 to go 100% CPU and never come 

back. The process would be like that until I killed it. If I turned on 

more verbose logging I'd see this:

[Mon Mar 29 18:42:00 2004] [info] Connection to child 7 established 

(server www.motorweb.co.nz:443, client 172.20.20.106)

[Mon Mar 29 18:42:00 2004] [info] Seeding PRNG with 0 bytes of entropy
[Mon Mar 29 18:42:00 2004] [info] No acceptable peer certificate available

[Mon Mar 29 18:42:00 2004] [info] Connection to child 7 closed with 

abortive shutdown(server www.motorweb.co.nz:443, client 172.20.20.106)

If the user has been to a SSLVerifyClient require location on 443 first 

before going to 444 then the "No acceptable peer certificate" doesn't 

happen and the httpd process doesn't go mad.

This only happened with specific copies of IE in our office. Even the 

exact same version number may cause it on one computer, but not on 

another. Very weird indeed.

The problem I described up the top is different to this. In the way that 

the CPU usage seems to jump from process to process, where as with this 

443/444 issue it just stuck on one process. But it gets me to wondering. 

Would it be worth me separating out the configuration of the server on 

444 (as it shares very little configuration with the others) and run it 

in a separate httpd instance? I presume running a separate httpd 

instance is no problem?

If you've made it this far, thank you.

So, is it an SSL cache problem, another kind of SSL issue, something 

wrong with my perl script, a lack of memory, a DoS attack??? I just 

can't figure it out. As you can see, I've done a lot of investigation on 

this and am stumped. Any ideas, even suggestions on places I should look 

would help.

thanks,
Damon.

As I promised above, here's the tail bit of the log when I had all the 

mutex errors after a the graceful restart:

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 

acquire global mutex lock

[Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open 

SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 

release global mutex lock

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 

acquire global mutex lock

[Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open 

SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 

release global mutex lock

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 

acquire global mutex lock

[Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open 

SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 

release global mutex lock

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 

acquire global mutex lock

[Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open 

SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 

release global mutex lock

[Mon Apr 04 09:22:59 2005] [error] Cannot re-open SSLSessionCache DBM 

file `/var/log/httpd/ssl_cache' for expiring

[Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to 

release global mutex lock

[Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to 

acquire global mutex lock

[Mon Apr 04 09:22:59 2005] [error] (13)Permission denied: Cannot open 

SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for reading (fetch)

[Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to 

release global mutex lock

Apache::Filter: (103) Software caused connection abort at -e line 0[Mon 

Apr 04 09:23:00 2005] [notice] Apache/2.0.53 configured -- resuming 

normal operations

[Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to 

acquire global mutex lock

[Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to 

release global mutex lock

[Mon Apr 04 09:23:01 2005] [error] [client 222.152.162.185] 

Apache::Filter::print: (103) Software caused connection abort at 

/opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer: 

https://www.motorweb.co.nz:444/wofonline/wfpro/default.asp?category=wfpro&service=record_express

Apache::Filter: (103) Software caused connection abort at -e line 0[Mon 

Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to acquire 

global mutex lock

[Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to 

release global mutex lock

[Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to 

acquire global mutex lock

[Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to 

release global mutex lock

And then things carried on normally.

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
  "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx