Re: [users@httpd] After a couple days httpd child processes go 100% CPU

"herauthon" <herauthon@xxxxxxx> · Thu, 21 Apr 2005 23:26:57 +0200

i think about; "garbage collection"..
its the releasing of unused vars and
values, and recovering the memory space.

but why in this way?

----- Original Message -----
From: "Damon Maria" <damon@xxxxxxxxxxxxxx>
To: <users@xxxxxxxxxxxxxxxx>
Sent: Thursday, April 21, 2005 2:31 PM
Subject: [users@httpd] After a couple days httpd child processes go 100% CPU

> Sorry for the not very descriptive subject, but this is a hard to
> describe problem...
>
> Some basic details:
> - httpd-2.0.54 (although I've had this problem since at least 2.0.49)
> - Redhat 9.0 (kernel 2.4.20-8)
> - 2.4GHz P4 256M mem.
>
> After some time (usually a couple of days) of running fine and averaging
> about 0.5% CPU usage httpd suddenly goes mad. A particular child process
> will be at close to 100% CPU usage. It will stay like that for a couple
> of seconds before returning to normal CPU usage as another child process
> goes to 100% CPU for a couple seconds. And so on and so on.
>
> A server restart fixes the problem. Once it took 2 restarts. Even a
> graceful restart will do it but results in issues with a global mutex
> lock related to the SSLSessionCache which hasn't been released.
>
> I added in a cronjob to restart the server every night at 5:00am. That
> seems to hold the problem at bay for a while but eventually it happens
> within the 24 hours between restarts (this time it was about 11 hours
> after the cronjob restart). So the length of time it takes to show up
> the problem isn't constant.
>
> This issue has been around for quite a few months now. At first it only
> happened occasionally but now it's now getting more and more frequent -
> I guess tracking with the increase of traffic on our site. I tried
> reverting back to 2.0.49 but that made no difference. I've since
> upgraded with each release but still have the problem. This site has
> been running on apache 2 for a couple of years now.
>
> I've written a script that continually dumps out the result of the
> server-status handler with ExtendedStatus On and runs top in batch mode
> piping it out to a log. Last time the issue happened I ran this script
> so I'd have a record of what CPU usage there was per process
> second-by-second as well as what those processes were doing according to
> httpd.
>
> The output of top shows only one httpd child process is ever running
> 100% CPU, and it changes to a different child process every couple of
> seconds. The line from server-status for that process always looks the
> same, something like:
>
> Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request
> 68-1 22174 0/1/1 R 0.00 73 578 0.0 0.01 0.01 ? ? ..reading..
>
> As you can see, it has used no CPU and has been sitting there for 73
> seconds handling this request (SS). If I trace this process (68-1 22174)
> through page after page of output from server-status the SS value
> (Seconds since beginning of most recent request) increases but none of
> the other stats change. Strange: the process was listed as close to 100%
> CPU for a couple of seconds by top but this doesn't show up in any of
> the stats from server-status.
>
> As the client, vhost and request are never shown for these runaway
> processes I can't even figure out in which area of our configuration
> these requests are hitting.
>
> By the time I was notified to the problem and got the script running
> there were already ~120 children looking like this and only 7 children
> listed with any request information. From the header of server-status:
>
> RRRRRRRRRRRRRRRWRRRRRRRRRRCRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
> RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRKRRRRRRRRRRRRRRRRRRRWWR___RR
> RR_____.........................................................
> ................................................................
>
> Also from the log of top I can see that overall the machine is 0% idle,
> and 0% io wait, but about 85% system and only 15% user. Although I'm not
> sure what (if anything) that means. The memory is close to full but only
> 8M of swap is being used, so I don't think that's a problem.
>
> I'd like to run through an aspect of our setup that is not so ordinary,
> as this may give someone a clue as to what's going on. We run SSL on two
> ports: 443 and 444. This is to get around this bug:
>
> http://issues.apache.org/bugzilla/show_bug.cgi?id=12355
>
> Some URLs on 443 require a client certificate, and these always use GET
> rather than POST (because of the above bug). We also proxy another site
> (that is out of our control) which we must secure with client
> certificates. As this site uses POST we have to put this onto a virtual
> host on a different port so we can specify SSLVerifyClient require for
> that entire <virtualhost>, rather than for particular <location>s.
>
> Our (truncated) SSL setup:
>
> SSLSessionCache dbm:/var/log/httpd/ssl_cache
> SSLSessionCacheTimeout 600
> SSLVerifyDepth 1
> SSLCACertificateFile XXXXXXX
> SSLVerifyClient none
> SSLCipherSuite
> ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL
> SSLCertificateFile XXXXX
> SSLCertificateKeyFile XXXXX
> SSLCertificateChainFile XXXXX
> <VirtualHost *:443>
>    ServerName www.motorweb.co.nz
>    SSLEngine on
>    <LocationMatch "^/YYYY/ZZZZ">
>      SSLVerifyClient require
>      SSLOptions +OptRenegotiate +StdEnvVars
>    </LocationMatch>
>    ... several other <location>s like the one above
> </VirtualHost>
> <VirtualHost *:444>
>    ServerName www.motorweb.co.nz
>    SSLEngine on
>    SSLVerifyClient require
>    SSLOptions +OptRenegotiate +StdEnvVars
> </VirtualHost>
>
> Side note: we used to have only a single IP address which is why we used
> another port rather than a different IP for the proxied site. We now
> have multiple IPs and so will be moving this proxying onto a different
> IP on port 443 at some stage.
>
> To make our configuration even more complex we also have to
> screen-scrape the pages being returned by the proxied site. To do this I
> wrote a perl module that uses PerlOutputFilterHandler (although, to be
> honest, I know close to nothing about perl).
>
> Our perl setup (using mod_perl-2.0.0-RC4):
>
> LoadModule perl_module        modules/mod_perl.so
> PerlSwitches -I/opt/motorweb/modperl
> PerlModule Apache2
> PerlModule Motorweb::WOFFilter
> <VirtualHost *:444>
>    ...
>    <Location /abc/>
>      ProxyPass http://XXXXX/
>      ProxyPassReverse http://XXXXX/
>      PerlOutputFilterHandler Motorweb::WOFFilter
>    </Location>
> </VirtualHost>
>
> This perl module had been running fine for about 6 months before this
> issue turned up, and hasn't been changed since. So I don't think that
> it's the problem.
>
> I have suspected it may be something to do with SSL as once when this
> happened I could access a page fine through http:// but the same page
> through https:// wouldn't return anything (browser would timeout).
> However, on other occasions even http:// doesn't work (although this may
> be because there just isn't the CPU time available).
>
> Our SSLSessionCache files also get really big (ssl_cache.pag is
> currently around 500MB), and while this issue was happening one time I
> checked and the ssl_cache.dir was suspiciously exactly 65536 bytes. But
> it gets recreated on a restart and always comes back to the same size,
> so I guess they're supposed to be that big? Is there a limit on the size
> of the cache or their files? I've also noticed the occasional line like
> this in the error_log (about once a day, sometimes more, sometimes less):
>
> [Wed Apr 20 16:00:18 2005] [error] (20014)Error string not specified
> yet: Cannot store SSL session to DBM file `/var/log/httpd/ssl_cache'
>
> Also remember my note up the top about when I tried a graceful restart.
> I guess the runaway child processes were still hanging around and
> holding onto a global mutex lock to the SSLSessionCache because I got
> the following lines in the error log:
>
> Mon Apr 04 09:22:54 2005] [notice] Graceful restart requested, doing
restart
> [Mon Apr 04 09:22:57 2005] [warn] (43)Identifier removed: Failed to
> acquire global mutex lock
> [Mon Apr 04 09:22:57 2005] [error] (13)Permission denied: Cannot open
> SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for scanning
> [Mon Apr 04 09:22:57 2005] [warn] (22)Invalid argument: Failed to
> release global mutex lock
>
> And then the same last three lines repeated over and over and over.
> Although usually "for scanning" was "for reading", but became almost
> entirely "for writing" near the end, and sometimes the acquire error was
> prefixed by "(22)Invalid argument" rather than "(43)Identifier removed".
> This went on for 6 seconds before clearing itself up. The last bit gets
> interesting and I've added those log lines to the end of this post
> because it also relates to some other stuff I want to tell you about
first.
>
> Does this give me a hint that it's something to do with the SSL cache?
> Presumably one of the child processes that had gone wild was holding
> onto this mutex to produce this problem after the restart indicating
> that that is the point it had 'locked up'?
>
> See also my notes below about other SSL problems I've had.
>
> Speaking of the error log, here are the other errors (besides "File does
> not exist") that regularly appear in the logs:
>
> [Thu Apr 21 17:08:24 2005] [error] Re-negotiation handshake failed: Not
> accepted by client!?
>
> I get thousands of these (more than 50 an hour at peek times). SSL does
> work fine and so do our client certificates, our users are the kind that
> tell us if it's not working. I've tried to figure out what this means
> but all other posters on the net with this error get additional error
> messages with it. All I get is that one error over and over again.
>
> Really the only other error that fills the error_log is something along
> the lines of this:
>
> [Wed Jan 05 08:39:44 2005] [error] [client 219.89.127.102]
> Apache::Filter::print: (103) Software caused connection abort at
> /opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer:
> https://www.motorweb.co.nz:444/xxx
> Apache::Filter: (103) Software caused connection abort at -e line 0[Wed
> Jan 05 08:42:10 2005] [error] Re-negotiation handshake failed: Not
> accepted by client!?
>
> That's not my email formatting either. "Apache::Filter: (103)..." is on
> a new line but "[Wed Jan 05 08:42:10 2005] [error]..." is *not*. I guess
> something isn't adding newlines into the log properly. I've always
> assumed that this is simply the user refreshing/stopping their browser
> half way through a page being filtered. But I'm not sure. This "Software
> caused connection abort" in my perl filter happens a lot when this 100%
> CPU issue is happening. But that I presume is just because users are
> getting sick of not seeing anything and hitting refresh.
>
> One more thing about my setup. I use the java servlet container Resin
> which has an Apache module mod_caucho. This is a small and simple little
> module that passes requests off to a java based server on another
> machine. I don't think this is the problem because if the request has
> been parsed to the point where httpd can decide to use the handler
> caucho-request then I'd except to be seeing the request URL in the
> server-status output.
>
> Another issue relating to SSL and 443/444.
>
> I did previously strike what I thought may be a bug in httpd with
> running the two different SSL servers inside the one httpd instance. In
> some versions of IE, if a user goes to our https 443 site, but *does
> not* access any locations that use SSLVerifyClient, and then goes to the
> 444 site (which is SSLVerifyClient require) then this may cause the
> httpd process they connected to on 444 to go 100% CPU and never come
> back. The process would be like that until I killed it. If I turned on
> more verbose logging I'd see this:
>
> [Mon Mar 29 18:42:00 2004] [info] Connection to child 7 established
> (server www.motorweb.co.nz:443, client 172.20.20.106)
> [Mon Mar 29 18:42:00 2004] [info] Seeding PRNG with 0 bytes of entropy
> [Mon Mar 29 18:42:00 2004] [info] No acceptable peer certificate available
> [Mon Mar 29 18:42:00 2004] [info] Connection to child 7 closed with
> abortive shutdown(server www.motorweb.co.nz:443, client 172.20.20.106)
>
> If the user has been to a SSLVerifyClient require location on 443 first
> before going to 444 then the "No acceptable peer certificate" doesn't
> happen and the httpd process doesn't go mad.
>
> This only happened with specific copies of IE in our office. Even the
> exact same version number may cause it on one computer, but not on
> another. Very weird indeed.
>
> The problem I described up the top is different to this. In the way that
> the CPU usage seems to jump from process to process, where as with this
> 443/444 issue it just stuck on one process. But it gets me to wondering.
> Would it be worth me separating out the configuration of the server on
> 444 (as it shares very little configuration with the others) and run it
> in a separate httpd instance? I presume running a separate httpd
> instance is no problem?
>
> If you've made it this far, thank you.
>
> So, is it an SSL cache problem, another kind of SSL issue, something
> wrong with my perl script, a lack of memory, a DoS attack??? I just
> can't figure it out. As you can see, I've done a lot of investigation on
> this and am stumped. Any ideas, even suggestions on places I should look
> would help.
>
> thanks,
> Damon.
>
>
> As I promised above, here's the tail bit of the log when I had all the
> mutex errors after a the graceful restart:
>
> [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to
> acquire global mutex lock
> [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open
> SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)
> [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to
> release global mutex lock
> [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to
> acquire global mutex lock
> [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open
> SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)
> [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to
> release global mutex lock
> [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to
> acquire global mutex lock
> [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open
> SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)
> [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to
> release global mutex lock
> [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to
> acquire global mutex lock
> [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open
> SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)
> [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to
> release global mutex lock
> [Mon Apr 04 09:22:59 2005] [error] Cannot re-open SSLSessionCache DBM
> file `/var/log/httpd/ssl_cache' for expiring
> [Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to
> release global mutex lock
> [Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to
> acquire global mutex lock
> [Mon Apr 04 09:22:59 2005] [error] (13)Permission denied: Cannot open
> SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for reading (fetch)
> [Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to
> release global mutex lock
> Apache::Filter: (103) Software caused connection abort at -e line 0[Mon
> Apr 04 09:23:00 2005] [notice] Apache/2.0.53 configured -- resuming
> normal operations
> [Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to
> acquire global mutex lock
> [Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to
> release global mutex lock
> [Mon Apr 04 09:23:01 2005] [error] [client 222.152.162.185]
> Apache::Filter::print: (103) Software caused connection abort at
> /opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer:
>
https://www.motorweb.co.nz:444/wofonline/wfpro/default.asp?category=wfpro&se
rvice=record_express
> Apache::Filter: (103) Software caused connection abort at -e line 0[Mon
> Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to acquire
> global mutex lock
> [Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to
> release global mutex lock
> [Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to
> acquire global mutex lock
> [Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to
> release global mutex lock
>
> And then things carried on normally.
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
>    "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
> For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx
>

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
   "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx