i think about; "garbage collection".. its the releasing of unused vars and values, and recovering the memory space. but why in this way? ----- Original Message ----- From: "Damon Maria" <damon@xxxxxxxxxxxxxx> To: <users@xxxxxxxxxxxxxxxx> Sent: Thursday, April 21, 2005 2:31 PM Subject: [users@httpd] After a couple days httpd child processes go 100% CPU > Sorry for the not very descriptive subject, but this is a hard to > describe problem... > > Some basic details: > - httpd-2.0.54 (although I've had this problem since at least 2.0.49) > - Redhat 9.0 (kernel 2.4.20-8) > - 2.4GHz P4 256M mem. > > After some time (usually a couple of days) of running fine and averaging > about 0.5% CPU usage httpd suddenly goes mad. A particular child process > will be at close to 100% CPU usage. It will stay like that for a couple > of seconds before returning to normal CPU usage as another child process > goes to 100% CPU for a couple seconds. And so on and so on. > > A server restart fixes the problem. Once it took 2 restarts. Even a > graceful restart will do it but results in issues with a global mutex > lock related to the SSLSessionCache which hasn't been released. > > I added in a cronjob to restart the server every night at 5:00am. That > seems to hold the problem at bay for a while but eventually it happens > within the 24 hours between restarts (this time it was about 11 hours > after the cronjob restart). So the length of time it takes to show up > the problem isn't constant. > > This issue has been around for quite a few months now. At first it only > happened occasionally but now it's now getting more and more frequent - > I guess tracking with the increase of traffic on our site. I tried > reverting back to 2.0.49 but that made no difference. I've since > upgraded with each release but still have the problem. This site has > been running on apache 2 for a couple of years now. > > I've written a script that continually dumps out the result of the > server-status handler with ExtendedStatus On and runs top in batch mode > piping it out to a log. Last time the issue happened I ran this script > so I'd have a record of what CPU usage there was per process > second-by-second as well as what those processes were doing according to > httpd. > > The output of top shows only one httpd child process is ever running > 100% CPU, and it changes to a different child process every couple of > seconds. The line from server-status for that process always looks the > same, something like: > > Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request > 68-1 22174 0/1/1 R 0.00 73 578 0.0 0.01 0.01 ? ? ..reading.. > > As you can see, it has used no CPU and has been sitting there for 73 > seconds handling this request (SS). If I trace this process (68-1 22174) > through page after page of output from server-status the SS value > (Seconds since beginning of most recent request) increases but none of > the other stats change. Strange: the process was listed as close to 100% > CPU for a couple of seconds by top but this doesn't show up in any of > the stats from server-status. > > As the client, vhost and request are never shown for these runaway > processes I can't even figure out in which area of our configuration > these requests are hitting. > > By the time I was notified to the problem and got the script running > there were already ~120 children looking like this and only 7 children > listed with any request information. From the header of server-status: > > RRRRRRRRRRRRRRRWRRRRRRRRRRCRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR > RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRKRRRRRRRRRRRRRRRRRRRWWR___RR > RR_____......................................................... > ................................................................ > > Also from the log of top I can see that overall the machine is 0% idle, > and 0% io wait, but about 85% system and only 15% user. Although I'm not > sure what (if anything) that means. The memory is close to full but only > 8M of swap is being used, so I don't think that's a problem. > > I'd like to run through an aspect of our setup that is not so ordinary, > as this may give someone a clue as to what's going on. We run SSL on two > ports: 443 and 444. This is to get around this bug: > > http://issues.apache.org/bugzilla/show_bug.cgi?id=12355 > > Some URLs on 443 require a client certificate, and these always use GET > rather than POST (because of the above bug). We also proxy another site > (that is out of our control) which we must secure with client > certificates. As this site uses POST we have to put this onto a virtual > host on a different port so we can specify SSLVerifyClient require for > that entire <virtualhost>, rather than for particular <location>s. > > Our (truncated) SSL setup: > > SSLSessionCache dbm:/var/log/httpd/ssl_cache > SSLSessionCacheTimeout 600 > SSLVerifyDepth 1 > SSLCACertificateFile XXXXXXX > SSLVerifyClient none > SSLCipherSuite > ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL > SSLCertificateFile XXXXX > SSLCertificateKeyFile XXXXX > SSLCertificateChainFile XXXXX > <VirtualHost *:443> > ServerName www.motorweb.co.nz > SSLEngine on > <LocationMatch "^/YYYY/ZZZZ"> > SSLVerifyClient require > SSLOptions +OptRenegotiate +StdEnvVars > </LocationMatch> > ... several other <location>s like the one above > </VirtualHost> > <VirtualHost *:444> > ServerName www.motorweb.co.nz > SSLEngine on > SSLVerifyClient require > SSLOptions +OptRenegotiate +StdEnvVars > </VirtualHost> > > Side note: we used to have only a single IP address which is why we used > another port rather than a different IP for the proxied site. We now > have multiple IPs and so will be moving this proxying onto a different > IP on port 443 at some stage. > > To make our configuration even more complex we also have to > screen-scrape the pages being returned by the proxied site. To do this I > wrote a perl module that uses PerlOutputFilterHandler (although, to be > honest, I know close to nothing about perl). > > Our perl setup (using mod_perl-2.0.0-RC4): > > LoadModule perl_module modules/mod_perl.so > PerlSwitches -I/opt/motorweb/modperl > PerlModule Apache2 > PerlModule Motorweb::WOFFilter > <VirtualHost *:444> > ... > <Location /abc/> > ProxyPass http://XXXXX/ > ProxyPassReverse http://XXXXX/ > PerlOutputFilterHandler Motorweb::WOFFilter > </Location> > </VirtualHost> > > This perl module had been running fine for about 6 months before this > issue turned up, and hasn't been changed since. So I don't think that > it's the problem. > > I have suspected it may be something to do with SSL as once when this > happened I could access a page fine through http:// but the same page > through https:// wouldn't return anything (browser would timeout). > However, on other occasions even http:// doesn't work (although this may > be because there just isn't the CPU time available). > > Our SSLSessionCache files also get really big (ssl_cache.pag is > currently around 500MB), and while this issue was happening one time I > checked and the ssl_cache.dir was suspiciously exactly 65536 bytes. But > it gets recreated on a restart and always comes back to the same size, > so I guess they're supposed to be that big? Is there a limit on the size > of the cache or their files? I've also noticed the occasional line like > this in the error_log (about once a day, sometimes more, sometimes less): > > [Wed Apr 20 16:00:18 2005] [error] (20014)Error string not specified > yet: Cannot store SSL session to DBM file `/var/log/httpd/ssl_cache' > > Also remember my note up the top about when I tried a graceful restart. > I guess the runaway child processes were still hanging around and > holding onto a global mutex lock to the SSLSessionCache because I got > the following lines in the error log: > > Mon Apr 04 09:22:54 2005] [notice] Graceful restart requested, doing restart > [Mon Apr 04 09:22:57 2005] [warn] (43)Identifier removed: Failed to > acquire global mutex lock > [Mon Apr 04 09:22:57 2005] [error] (13)Permission denied: Cannot open > SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for scanning > [Mon Apr 04 09:22:57 2005] [warn] (22)Invalid argument: Failed to > release global mutex lock > > And then the same last three lines repeated over and over and over. > Although usually "for scanning" was "for reading", but became almost > entirely "for writing" near the end, and sometimes the acquire error was > prefixed by "(22)Invalid argument" rather than "(43)Identifier removed". > This went on for 6 seconds before clearing itself up. The last bit gets > interesting and I've added those log lines to the end of this post > because it also relates to some other stuff I want to tell you about first. > > Does this give me a hint that it's something to do with the SSL cache? > Presumably one of the child processes that had gone wild was holding > onto this mutex to produce this problem after the restart indicating > that that is the point it had 'locked up'? > > See also my notes below about other SSL problems I've had. > > Speaking of the error log, here are the other errors (besides "File does > not exist") that regularly appear in the logs: > > [Thu Apr 21 17:08:24 2005] [error] Re-negotiation handshake failed: Not > accepted by client!? > > I get thousands of these (more than 50 an hour at peek times). SSL does > work fine and so do our client certificates, our users are the kind that > tell us if it's not working. I've tried to figure out what this means > but all other posters on the net with this error get additional error > messages with it. All I get is that one error over and over again. > > Really the only other error that fills the error_log is something along > the lines of this: > > [Wed Jan 05 08:39:44 2005] [error] [client 219.89.127.102] > Apache::Filter::print: (103) Software caused connection abort at > /opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer: > https://www.motorweb.co.nz:444/xxx > Apache::Filter: (103) Software caused connection abort at -e line 0[Wed > Jan 05 08:42:10 2005] [error] Re-negotiation handshake failed: Not > accepted by client!? > > That's not my email formatting either. "Apache::Filter: (103)..." is on > a new line but "[Wed Jan 05 08:42:10 2005] [error]..." is *not*. I guess > something isn't adding newlines into the log properly. I've always > assumed that this is simply the user refreshing/stopping their browser > half way through a page being filtered. But I'm not sure. This "Software > caused connection abort" in my perl filter happens a lot when this 100% > CPU issue is happening. But that I presume is just because users are > getting sick of not seeing anything and hitting refresh. > > One more thing about my setup. I use the java servlet container Resin > which has an Apache module mod_caucho. This is a small and simple little > module that passes requests off to a java based server on another > machine. I don't think this is the problem because if the request has > been parsed to the point where httpd can decide to use the handler > caucho-request then I'd except to be seeing the request URL in the > server-status output. > > Another issue relating to SSL and 443/444. > > I did previously strike what I thought may be a bug in httpd with > running the two different SSL servers inside the one httpd instance. In > some versions of IE, if a user goes to our https 443 site, but *does > not* access any locations that use SSLVerifyClient, and then goes to the > 444 site (which is SSLVerifyClient require) then this may cause the > httpd process they connected to on 444 to go 100% CPU and never come > back. The process would be like that until I killed it. If I turned on > more verbose logging I'd see this: > > [Mon Mar 29 18:42:00 2004] [info] Connection to child 7 established > (server www.motorweb.co.nz:443, client 172.20.20.106) > [Mon Mar 29 18:42:00 2004] [info] Seeding PRNG with 0 bytes of entropy > [Mon Mar 29 18:42:00 2004] [info] No acceptable peer certificate available > [Mon Mar 29 18:42:00 2004] [info] Connection to child 7 closed with > abortive shutdown(server www.motorweb.co.nz:443, client 172.20.20.106) > > If the user has been to a SSLVerifyClient require location on 443 first > before going to 444 then the "No acceptable peer certificate" doesn't > happen and the httpd process doesn't go mad. > > This only happened with specific copies of IE in our office. Even the > exact same version number may cause it on one computer, but not on > another. Very weird indeed. > > The problem I described up the top is different to this. In the way that > the CPU usage seems to jump from process to process, where as with this > 443/444 issue it just stuck on one process. But it gets me to wondering. > Would it be worth me separating out the configuration of the server on > 444 (as it shares very little configuration with the others) and run it > in a separate httpd instance? I presume running a separate httpd > instance is no problem? > > If you've made it this far, thank you. > > So, is it an SSL cache problem, another kind of SSL issue, something > wrong with my perl script, a lack of memory, a DoS attack??? I just > can't figure it out. As you can see, I've done a lot of investigation on > this and am stumped. Any ideas, even suggestions on places I should look > would help. > > thanks, > Damon. > > > As I promised above, here's the tail bit of the log when I had all the > mutex errors after a the graceful restart: > > [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to > acquire global mutex lock > [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open > SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store) > [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to > release global mutex lock > [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to > acquire global mutex lock > [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open > SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store) > [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to > release global mutex lock > [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to > acquire global mutex lock > [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open > SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store) > [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to > release global mutex lock > [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to > acquire global mutex lock > [Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open > SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store) > [Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to > release global mutex lock > [Mon Apr 04 09:22:59 2005] [error] Cannot re-open SSLSessionCache DBM > file `/var/log/httpd/ssl_cache' for expiring > [Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to > release global mutex lock > [Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to > acquire global mutex lock > [Mon Apr 04 09:22:59 2005] [error] (13)Permission denied: Cannot open > SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for reading (fetch) > [Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to > release global mutex lock > Apache::Filter: (103) Software caused connection abort at -e line 0[Mon > Apr 04 09:23:00 2005] [notice] Apache/2.0.53 configured -- resuming > normal operations > [Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to > acquire global mutex lock > [Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to > release global mutex lock > [Mon Apr 04 09:23:01 2005] [error] [client 222.152.162.185] > Apache::Filter::print: (103) Software caused connection abort at > /opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer: > https://www.motorweb.co.nz:444/wofonline/wfpro/default.asp?category=wfpro&se rvice=record_express > Apache::Filter: (103) Software caused connection abort at -e line 0[Mon > Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to acquire > global mutex lock > [Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to > release global mutex lock > [Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to > acquire global mutex lock > [Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to > release global mutex lock > > And then things carried on normally. > > --------------------------------------------------------------------- > The official User-To-User support forum of the Apache HTTP Server Project. > See <URL:http://httpd.apache.org/userslist.html> for more info. > To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx > " from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx > For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx > --------------------------------------------------------------------- The official User-To-User support forum of the Apache HTTP Server Project. See <URL:http://httpd.apache.org/userslist.html> for more info. To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx " from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx