Re: Hung thread

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Have you looked at installing apache server status code so you can see what the last request is on each of these hung threads...

Alternatively if you have something like mod_perl installed one thing that you can do is add a handler to warn the PID/request to the error logs at the start and end of the requests (with an appropriate tag) then you can look at the history of the hung threads to see if there is anything consistent with them...

Before I've had threads hang if it is the request after a particular request - or on a particular set of circumstances for a particular request (infinite loop or something similar)

HTH

James

On 17/08/2015 20:18, Mark Jacquet wrote:
Jeff/Community

Getting back to this thread after a long time. We tried many things since this initial issue: Moved to linux, tried latest apache/apr/aprutils bins, tried adjusting the configuration, etc. All this failed eventually in the same way: Multiple hung threads eventually overloading the server.

In our current environment we switched to pre-fork mpm thinking that maybe threading was killing us. This seemed to work well until day 20 (which seems to be relevant as we got to day 20 a few times). Today all 200 procs (Max Servers) were launched, not one would die. All hung.

The root proc is in this state:

$sudo pstack 5362
#0  0x00000039892e1353 in __select_nocancel () from /lib64/libc.so.6
#1  0x00007ffff7989025 in apr_sleep () from /codeadm/http_servers/httpd-2.4.16-prefork/lib/libapr-1.so.0
#2  0x00000000004325ec in ap_wait_or_timeout ()
#3  0x0000000000469680 in prefork_run ()
#4  0x000000000043171e in ap_run_mpm ()
#5  0x000000000042b9e4 in main ()

Typical pstack from a hung proc is

$ sudo pstack 6100
#0  0x00007ffff7dd4955 in move_block () from /codeadm/http_servers/httpd-2.4.16-prefork/lib/libaprutil-1.so.0
#1  0x00007ffff7dd50a1 in apr_rmm_calloc () from /codeadm/http_servers/httpd-2.4.16-prefork/lib/libaprutil-1.so.0
#2  0x00007ffff5f26c66 in util_ald_strdup () from /codeadm/http_servers/httpd/modules/mod_ldap.so
#3  0x00007ffff5f2628a in util_ldap_search_node_copy () from /codeadm/http_servers/httpd/modules/mod_ldap.so
#4  0x00007ffff5f27235 in util_ald_cache_insert () from /codeadm/http_servers/httpd/modules/mod_ldap.so
#5  0x00007ffff5f2352d in uldap_cache_checkuserid () from /codeadm/http_servers/httpd/modules/mod_ldap.so
#6  0x00007ffff6b459ae in authn_ldap_check_password () from /codeadm/http_servers/httpd/modules/mod_authnz_ldap.so
#7  0x00007ffff673ae4f in authenticate_basic_user () from /codeadm/http_servers/httpd/modules/mod_auth_basic.so
#8  0x0000000000441c90 in ap_run_check_user_id ()
#9  0x00000000004451d2 in ap_process_request_internal ()
#10 0x00000000004627d8 in ap_process_async_request ()
#11 0x000000000046294f in ap_process_request ()
#12 0x000000000045ec9e in ap_process_http_connection ()
#13 0x00000000004567f0 in ap_run_process_connection ()
#14 0x000000000046900e in child_main ()
#15 0x0000000000469264 in make_child ()
#16 0x0000000000469d87 in prefork_run ()
#17 0x000000000043171e in ap_run_mpm ()
#18 0x000000000042b9e4 in main ()
[jacquet@llbdub0009 logs]$

Running on Red Hat Enterprise Linux Server release 6.6 (Santiago) with httpd-2.4.16-prefork.

Killing off these hung procs only band-aides the situation. New procs also hang (building up slowly now).
I am going to have to do a full restart of the server.
My expectation is that the server will be find again for another 20 days.

Grasping at straws now. Any thoughts on this? Anything to try?

Thanks
Mj





On Thursday, June 18, 2015 7:56 AM, Jeff Trawick <trawick@xxxxxxxxx> wrote:


On Wed, Jun 17, 2015 at 8:51 PM, Mark Jacquet <mark_jacquet@xxxxxxxxx.invalid> wrote:
Just another oddity to add to the issue.

Overnight several more hung threads appeared and the load on the system had jumped into the mid 20's.
After killing these the load did not drop. Looking at the list of running processes I found httpd's running,spawned from the original root httpd process that *were not even displayed* in the scoreboard!!  After killing these hidden zombies off the load dropped again.

What's common about the processes?  Similar backtrace to the first one posted?


 

So now I have to catch and kill two types: Zombies on the scoreboard and hidden zombies.

And this is cute. Some times the zombies hang around so long that when the system gets back to creating a new process for slot #1, if the zombie was originally in that slot it is displayed their along with it's brothers for the new process:


"scoreboard squatting"


e.g. Note process 19597 below

1-0166310/33/1320_ 131.22202255280.01.6035.79 10.172.91.217newyahoo.oak.sap.corp:80NULL 1-0166310/18/1087_ 105.88340736980.00.6926.65 10.172.240.113www-dse.oak.sap.corp:80GET /cgi-bin/websql/websql.dir/QTS/bugsheetcont.hts?bugid=74133 1-0166310/11/1178_ 76.49589542980.00.5634.78 10.172.91.92newyahoo.oak.sap.corp:80NULL 1-0166310/32/1295_ 92.17425417130.04.0342.07 10.172.240.113newyahoo.oak.sap.corp:80NULL 1-0195970/26/1319W 35.552441700.00.5437.10 10.172.248.87www-rev.oak.sap.corp:80GET /cgi-bin/rev.cgi?action="" HTTP/1.1 1-0166310/12/1427_ 18.41794100.00.14238.52 10.172.240.113newyahoo.oak.sap.corp:80NULL 1-0166310/27/1442_ 30.67719695430.00.7835.07 10.172.85.9newyahoo.oak.sap.corp:80NULL 1-0166310/19/784_ 10.70940630.00.4520.95 10.172.246.203newyahoo.oak.sap.corp:80NULL 1-0166310/8/1034_ 2.86103144630.00.0124.04 10.172.90.155newyahoo.oak.sap.corp:80NULL 2-0-0/0/99. 58.943145013820.00.002.15 10.136.66.135newyahoo.oak.sap.corp:80NULL 2-0-0/0/82. 2181.923144824390.00.001.48 10.162.65.165www-dse.oak.sap.corp:80POST /cgi-bin/websql/websql.dir/QTS/bugsescalated.pl?product=AN 2-0-0/0/162. 2027.12314509350.00.003.36 10.50.3.99newyahoo.oak.sap.corp:80NULL 2-0-0/0/576. 1704.40314504100.00.0013.38 10.172.240.113newyahoo.oak.sap.corp:80NULL 2-0-0/0/928. 1295.363145029750.00.0024.38 10.50.17.221newyahoo.oak.sap.corp:80NULL 2-0-0/0/852. 1798.52314503810.00.0020.72 10.162.65.165newyahoo.oak.sap.corp:80NULL 2-0-0/0/1084. 551.293145022210.00.0026.52 10.176.138.162newyahoo.oak.sap.corp:80NULL 2-0-0/0/1180. 385.833145019630.00.0034.31 10.162.65.197newyahoo.oak.sap.corp:80NULL 2-0-0/0/50. 50.713145000.00.001.62 10.58.181.166www-rev.oak.sap.corp:80GET /cgi-bin/rev.cgi?action="" HTTP/1.1 2-0137610/12/1078W 58.803489600.00.1031.67 10.172.107.38www-rev.oak.sap.corp:80POST /cgi-bin/rev.cgi HTTP/1.1 2-0-0/0/1075. 1061.5331450790.00.0031.65 10.172.90.88newyahoo.oak.sap.corp:80GET /server-status HTTP/1.1 2-0-0/0/1362. 46.803145080.00.0039.72 10.172.107.38www-rev.oak.sap.corp:80POST /cgi-bin/rev.cgi HTTP/1.1 2-0-0/0/1142. 56.693145011490.00.0035.22 10.172.240.113newyahoo.oak.sap.corp:80NUL
Slot #2 currently not being used (still has zombie)

MJ




Mj




On Tuesday, June 16, 2015 5:42 PM, Mark Jacquet <mark_jacquet@xxxxxxxxx.INVALID> wrote:


Upgrade as in Apache upgrade or Solaris 5.10 patch upgrad? :)

Apache is all new of course 2.4.12 with the latest add on sources (apr, pcre, etc)
The bad news is the OS is not at all up to date. And for reasons I have no control over, I cannot patch.
So if this is an OS issue then ......

I seem to be running with the Sun Native LDAP SDK. Would building against  different LDAP source help? (Open LDAP)?

Long term plan -> moving all Apache servers to Linux

Mj



On Tuesday, June 16, 2015 5:31 PM, Eric Covener <covener@xxxxxxxxx> wrote:


On Tue, Jun 16, 2015 at 8:23 PM, Mark Jacquet

<mark_jacquet@xxxxxxxxx.invalid> wrote:
> So do you think this hang is related to the native LDAP lib code?


It is possible but IMO not very likely. It has to corrutp memory just
enough to put a looping structure in apr_rmm.  What's your upgrade
history like?

--
Eric Covener
covener@xxxxxxxxx

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx









--
Born in Roswell... married an alien...
http://emptyhammock.com/





-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

[Index of Archives]     [Open SSH Users]     [Linux ACPI]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Squid]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux