[users@httpd] After a couple days httpd child processes go 100% CPU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the not very descriptive subject, but this is a hard to describe problem...

Some basic details:
- httpd-2.0.54 (although I've had this problem since at least 2.0.49)
- Redhat 9.0 (kernel 2.4.20-8)
- 2.4GHz P4 256M mem.

After some time (usually a couple of days) of running fine and averaging about 0.5% CPU usage httpd suddenly goes mad. A particular child process will be at close to 100% CPU usage. It will stay like that for a couple of seconds before returning to normal CPU usage as another child process goes to 100% CPU for a couple seconds. And so on and so on.

A server restart fixes the problem. Once it took 2 restarts. Even a graceful restart will do it but results in issues with a global mutex lock related to the SSLSessionCache which hasn't been released.

I added in a cronjob to restart the server every night at 5:00am. That seems to hold the problem at bay for a while but eventually it happens within the 24 hours between restarts (this time it was about 11 hours after the cronjob restart). So the length of time it takes to show up the problem isn't constant.

This issue has been around for quite a few months now. At first it only happened occasionally but now it's now getting more and more frequent - I guess tracking with the increase of traffic on our site. I tried reverting back to 2.0.49 but that made no difference. I've since upgraded with each release but still have the problem. This site has been running on apache 2 for a couple of years now.

I've written a script that continually dumps out the result of the server-status handler with ExtendedStatus On and runs top in batch mode piping it out to a log. Last time the issue happened I ran this script so I'd have a record of what CPU usage there was per process second-by-second as well as what those processes were doing according to httpd.

The output of top shows only one httpd child process is ever running 100% CPU, and it changes to a different child process every couple of seconds. The line from server-status for that process always looks the same, something like:

Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request
68-1 22174 0/1/1 R 0.00 73 578 0.0 0.01 0.01 ? ? ..reading..

As you can see, it has used no CPU and has been sitting there for 73 seconds handling this request (SS). If I trace this process (68-1 22174) through page after page of output from server-status the SS value (Seconds since beginning of most recent request) increases but none of the other stats change. Strange: the process was listed as close to 100% CPU for a couple of seconds by top but this doesn't show up in any of the stats from server-status.

As the client, vhost and request are never shown for these runaway processes I can't even figure out in which area of our configuration these requests are hitting.

By the time I was notified to the problem and got the script running there were already ~120 children looking like this and only 7 children listed with any request information. From the header of server-status:

RRRRRRRRRRRRRRRWRRRRRRRRRRCRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRKRRRRRRRRRRRRRRRRRRRWWR___RR
RR_____.........................................................
................................................................

Also from the log of top I can see that overall the machine is 0% idle, and 0% io wait, but about 85% system and only 15% user. Although I'm not sure what (if anything) that means. The memory is close to full but only 8M of swap is being used, so I don't think that's a problem.

I'd like to run through an aspect of our setup that is not so ordinary, as this may give someone a clue as to what's going on. We run SSL on two ports: 443 and 444. This is to get around this bug:

http://issues.apache.org/bugzilla/show_bug.cgi?id=12355

Some URLs on 443 require a client certificate, and these always use GET rather than POST (because of the above bug). We also proxy another site (that is out of our control) which we must secure with client certificates. As this site uses POST we have to put this onto a virtual host on a different port so we can specify SSLVerifyClient require for that entire <virtualhost>, rather than for particular <location>s.

Our (truncated) SSL setup:

SSLSessionCache dbm:/var/log/httpd/ssl_cache
SSLSessionCacheTimeout 600
SSLVerifyDepth 1
SSLCACertificateFile XXXXXXX
SSLVerifyClient none
SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL
SSLCertificateFile XXXXX
SSLCertificateKeyFile XXXXX
SSLCertificateChainFile XXXXX
<VirtualHost *:443>
  ServerName www.motorweb.co.nz
  SSLEngine on
  <LocationMatch "^/YYYY/ZZZZ">
    SSLVerifyClient require
    SSLOptions +OptRenegotiate +StdEnvVars
  </LocationMatch>
  ... several other <location>s like the one above
</VirtualHost>
<VirtualHost *:444>
  ServerName www.motorweb.co.nz
  SSLEngine on
  SSLVerifyClient require
  SSLOptions +OptRenegotiate +StdEnvVars
</VirtualHost>

Side note: we used to have only a single IP address which is why we used another port rather than a different IP for the proxied site. We now have multiple IPs and so will be moving this proxying onto a different IP on port 443 at some stage.

To make our configuration even more complex we also have to screen-scrape the pages being returned by the proxied site. To do this I wrote a perl module that uses PerlOutputFilterHandler (although, to be honest, I know close to nothing about perl).

Our perl setup (using mod_perl-2.0.0-RC4):

LoadModule perl_module        modules/mod_perl.so
PerlSwitches -I/opt/motorweb/modperl
PerlModule Apache2
PerlModule Motorweb::WOFFilter
<VirtualHost *:444>
  ...
  <Location /abc/>
    ProxyPass http://XXXXX/
    ProxyPassReverse http://XXXXX/
    PerlOutputFilterHandler Motorweb::WOFFilter
  </Location>
</VirtualHost>

This perl module had been running fine for about 6 months before this issue turned up, and hasn't been changed since. So I don't think that it's the problem.

I have suspected it may be something to do with SSL as once when this happened I could access a page fine through http:// but the same page through https:// wouldn't return anything (browser would timeout). However, on other occasions even http:// doesn't work (although this may be because there just isn't the CPU time available).

Our SSLSessionCache files also get really big (ssl_cache.pag is currently around 500MB), and while this issue was happening one time I checked and the ssl_cache.dir was suspiciously exactly 65536 bytes. But it gets recreated on a restart and always comes back to the same size, so I guess they're supposed to be that big? Is there a limit on the size of the cache or their files? I've also noticed the occasional line like this in the error_log (about once a day, sometimes more, sometimes less):

[Wed Apr 20 16:00:18 2005] [error] (20014)Error string not specified yet: Cannot store SSL session to DBM file `/var/log/httpd/ssl_cache'

Also remember my note up the top about when I tried a graceful restart. I guess the runaway child processes were still hanging around and holding onto a global mutex lock to the SSLSessionCache because I got the following lines in the error log:

Mon Apr 04 09:22:54 2005] [notice] Graceful restart requested, doing restart
[Mon Apr 04 09:22:57 2005] [warn] (43)Identifier removed: Failed to acquire global mutex lock [Mon Apr 04 09:22:57 2005] [error] (13)Permission denied: Cannot open SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for scanning [Mon Apr 04 09:22:57 2005] [warn] (22)Invalid argument: Failed to release global mutex lock

And then the same last three lines repeated over and over and over. Although usually "for scanning" was "for reading", but became almost entirely "for writing" near the end, and sometimes the acquire error was prefixed by "(22)Invalid argument" rather than "(43)Identifier removed". This went on for 6 seconds before clearing itself up. The last bit gets interesting and I've added those log lines to the end of this post because it also relates to some other stuff I want to tell you about first.

Does this give me a hint that it's something to do with the SSL cache? Presumably one of the child processes that had gone wild was holding onto this mutex to produce this problem after the restart indicating that that is the point it had 'locked up'?

See also my notes below about other SSL problems I've had.

Speaking of the error log, here are the other errors (besides "File does not exist") that regularly appear in the logs:

[Thu Apr 21 17:08:24 2005] [error] Re-negotiation handshake failed: Not accepted by client!?

I get thousands of these (more than 50 an hour at peek times). SSL does work fine and so do our client certificates, our users are the kind that tell us if it's not working. I've tried to figure out what this means but all other posters on the net with this error get additional error messages with it. All I get is that one error over and over again.

Really the only other error that fills the error_log is something along the lines of this:

[Wed Jan 05 08:39:44 2005] [error] [client 219.89.127.102] Apache::Filter::print: (103) Software caused connection abort at /opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer: https://www.motorweb.co.nz:444/xxx Apache::Filter: (103) Software caused connection abort at -e line 0[Wed Jan 05 08:42:10 2005] [error] Re-negotiation handshake failed: Not accepted by client!?

That's not my email formatting either. "Apache::Filter: (103)..." is on a new line but "[Wed Jan 05 08:42:10 2005] [error]..." is *not*. I guess something isn't adding newlines into the log properly. I've always assumed that this is simply the user refreshing/stopping their browser half way through a page being filtered. But I'm not sure. This "Software caused connection abort" in my perl filter happens a lot when this 100% CPU issue is happening. But that I presume is just because users are getting sick of not seeing anything and hitting refresh.

One more thing about my setup. I use the java servlet container Resin which has an Apache module mod_caucho. This is a small and simple little module that passes requests off to a java based server on another machine. I don't think this is the problem because if the request has been parsed to the point where httpd can decide to use the handler caucho-request then I'd except to be seeing the request URL in the server-status output.

Another issue relating to SSL and 443/444.

I did previously strike what I thought may be a bug in httpd with running the two different SSL servers inside the one httpd instance. In some versions of IE, if a user goes to our https 443 site, but *does not* access any locations that use SSLVerifyClient, and then goes to the 444 site (which is SSLVerifyClient require) then this may cause the httpd process they connected to on 444 to go 100% CPU and never come back. The process would be like that until I killed it. If I turned on more verbose logging I'd see this:

[Mon Mar 29 18:42:00 2004] [info] Connection to child 7 established (server www.motorweb.co.nz:443, client 172.20.20.106)
[Mon Mar 29 18:42:00 2004] [info] Seeding PRNG with 0 bytes of entropy
[Mon Mar 29 18:42:00 2004] [info] No acceptable peer certificate available
[Mon Mar 29 18:42:00 2004] [info] Connection to child 7 closed with abortive shutdown(server www.motorweb.co.nz:443, client 172.20.20.106)

If the user has been to a SSLVerifyClient require location on 443 first before going to 444 then the "No acceptable peer certificate" doesn't happen and the httpd process doesn't go mad.

This only happened with specific copies of IE in our office. Even the exact same version number may cause it on one computer, but not on another. Very weird indeed.

The problem I described up the top is different to this. In the way that the CPU usage seems to jump from process to process, where as with this 443/444 issue it just stuck on one process. But it gets me to wondering. Would it be worth me separating out the configuration of the server on 444 (as it shares very little configuration with the others) and run it in a separate httpd instance? I presume running a separate httpd instance is no problem?

If you've made it this far, thank you.

So, is it an SSL cache problem, another kind of SSL issue, something wrong with my perl script, a lack of memory, a DoS attack??? I just can't figure it out. As you can see, I've done a lot of investigation on this and am stumped. Any ideas, even suggestions on places I should look would help.

thanks,
Damon.


As I promised above, here's the tail bit of the log when I had all the mutex errors after a the graceful restart:

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to acquire global mutex lock [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store) [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to release global mutex lock [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to acquire global mutex lock [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store) [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to release global mutex lock [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to acquire global mutex lock [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store) [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to release global mutex lock [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to acquire global mutex lock [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store) [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to release global mutex lock [Mon Apr 04 09:22:59 2005] [error] Cannot re-open SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for expiring [Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to release global mutex lock [Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to acquire global mutex lock [Mon Apr 04 09:22:59 2005] [error] (13)Permission denied: Cannot open SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for reading (fetch) [Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to release global mutex lock Apache::Filter: (103) Software caused connection abort at -e line 0[Mon Apr 04 09:23:00 2005] [notice] Apache/2.0.53 configured -- resuming normal operations [Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to acquire global mutex lock [Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to release global mutex lock [Mon Apr 04 09:23:01 2005] [error] [client 222.152.162.185] Apache::Filter::print: (103) Software caused connection abort at /opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer: https://www.motorweb.co.nz:444/wofonline/wfpro/default.asp?category=wfpro&service=record_express Apache::Filter: (103) Software caused connection abort at -e line 0[Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to acquire global mutex lock [Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to release global mutex lock [Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to acquire global mutex lock [Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to release global mutex lock

And then things carried on normally.

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
  "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx



[Index of Archives]     [Open SSH Users]     [Linux ACPI]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Squid]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux