Re: Hung thread

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I just did a test and killed off 4 of the 6 processes with multiple threads stuck in the same place.
After each kill the "W's" went away (grocs gone from the scoreboard) and the load went down. The good news is that the server stayed up, and seems to be running fine.

So do you think this hang is related to the native LDAP lib code?

I built Apache/APR using:

[Mon Jun 08 14:30:49.297984 2015] [ldap:info] [pid 2604:tid 1] AH01318: APR LDAP: Built with Sun Microsystems Inc. LDAP SDK

I could download a different LDAP (OpenLDAP?) and rebuild with that.

MJ



On Tuesday, June 16, 2015 4:45 PM, Jeff Trawick <trawick@xxxxxxxxx> wrote:



On Jun 16, 2015 18:26, "Mark Jacquet" <mark_jacquet@xxxxxxxxx.invalid> wrote:
>
> I am seeing something very odd on our Apache 2.4.12 server  (SunOS myhostname 5.10 Generic_118833-36 sun4v sparc SUNW,Sun-Fire-T200)
> We are using MPM Worker.
>
> I have been watching the scoreboard all day monitoring system load and running processes/threads.
> Around 10AM the load jumped to from a normal < 1 to >7 then made it's way up to >20 where it has sat all day with 21 threads in status "W"
> I traced the threads back to the actual users here at work and asked them what they did, etc. No help there other than they both rapidly made requests to the server (one "restored" a browser session, the other rapidly clicked some URLs in a Word doc). One user even rebooted for me (no effect on Apache)
>
> In any case I have 21 threads in "W" state.
>
> The server has even gone on and created new process leaving these procs behind open with one or more thread active. But the load will not drop!
>
> Pstack of a hung process, this one only has one hung thread, looks like this:
>
>
> 3260:   /codeadm/http_servers/httpd/bin/httpd -f /codeadm/http_servers/httpd/c
> -----------------  lwp# 1 / thread# 1  --------------------
>  ff041714 lwp_wait (10, ffbff2ec)
>  ff03d11c _thrp_join (10, 0, ffbff354, 1, ffbff2ec, ff06cbc0) + 34
>  ff24fd08 apr_thread_join (ffbff3d4, 1ef320, ff06cbc0, 0, 0, ff3a2000) + 48
>  000d4490 join_workers (1ef4a0, 1f4a88, 1, 1eef00, 1eee50, 1883d0) + 2f8
>  000d4e80 child_main (2, d1988, ff06cbc0, 0, 0, ff3a2000) + 7f8
>  000d50a8 make_child (1883d0, 2, 134518, 7, 0, 1883d0) + 1b0
>  000d5cb0 perform_idle_server_maintenance (ffbff69c, ffbff698, ffbff684, 163188, 1883d0, ff3a0140) + a28
>  000d6300 server_main_loop (0, 0, 134518, 7, 0, 1883d0) + 548
>  000d67e8 worker_run (134518, 18a470, 1883d0, 150000, ff3a0100, ff3a0140) + 490
>  0005dd28 ap_run_mpm (163188, 18a470, 1883d0, 1883d0, 0, 0) + a8
>  0004e0e0 main     (5, ffbff8cc, ffbff8e4, 150000, ff3a0100, ff3a0140) + 17b0
>  0004b3b4 _start   (0, 0, 0, 0, 0, 0) + dc
> -----------------  lwp# 16 / thread# 16  --------------------
>  ff31dcc4 find_block_by_offset (19c550, 10, d778, 1, 0, 314628) + 8c
>  ff31e218 move_block (19c550, d778, 0, 0, 2, 0) + 228
>  ff31f44c apr_rmm_calloc (19c550, 18, fe8e4af8, c, 0, 314628) + 1fc
>  fe8e07bc util_ald_alloc (fe580670, 18, 0, 0, 2, 0) + 7c
>  fe8e1f20 util_ald_cache_insert (fe580670, fd0f9898, fe8e4af8, c, 0, 314628) + 170
>  fe8d9d2c uldap_cache_checkuserid (fe8e4af8, 0, 0, 0, 2, 0) + 1044
>  fe9e3f74 authn_ldap_check_password (0, fd0f99ac, 31609f, fd0f9998, 80808080, 1010101) + 834
>  fe982470 authenticate_basic_user (314628, 0, 3145e8, 8d, 237120, 25aec0) + 608
>  0007f750 ap_run_check_user_id (314628, 236e78, 236e78, 2, d, 25aec0) + 90
>  000818fc ap_process_request_internal (314628, 0, 3145e8, 8d, 237120, 25aec0) + 6e4
>  000c5288 ap_process_async_request (314628, 236e78, 236e78, 2, d, 25aec0) + 638
>  000c5428 ap_process_request (314628, 4, 314628, 8d, 237120, 25aec0) + 20
>  000bddc0 ap_process_http_sync_connection (237128, 236e78, 236e78, 2, d, 25aec0) + f0
>  000bdfbc ap_process_http_connection (237128, 236e78, 236e78, 8d, 237120, 25aec0) + 64
>  000ab038 ap_run_process_connection (237128, 236e78, 236e78, 2, d, 25aec0) + 90
>  000ab9bc ap_process_connection (237128, 236e78, 236e78, 8d, 237120, 25aec0) + 8c
>  000d235c process_socket (1ef320, 236e30, 236e78, 2, d, 25aec0) + ec
>  000d373c worker_thread (1ef320, 1f6ef0, 0, 0, 0, 0) + 49c
>  ff24f894 dummy_worker (1ef320, fd0fc000, 0, 0, ff24f840, 1) + 54
>  ff0404f4 _lwp_start (0, 0, 0, 0, 0, 0)
> -----------------  lwp# 17 / thread# 17  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 18 / thread# 18  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 19 / thread# 19  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 20 / thread# 20  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 21 / thread# 21  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 22 / thread# 22  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 23 / thread# 23  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 24 / thread# 24  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 25 / thread# 25  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 26 / thread# 26  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> -----------------  lwp# 27 / thread# 27  --------------------
>  ff24f840 dummy_worker(), exit value = 0x00000000
>         ** zombie (exited, not detached, not yet joined) **
> newyahoo% 
>
>
>
> Partial ScoreBoard looks like:
>
> Server Version: Apache/2.4.12 (Unix)
> Server MPM: worker
> Server Built: Jun 3 2015 17:19:20
>
> Current Time: Tuesday, 16-Jun-2015 15:01:45 PDT
> Restart Time: Monday, 08-Jun-2015 14:30:49 PDT
> Parent Server Config. Generation: 1
> Parent Server MPM Generation: 0
> Server uptime: 8 days 30 minutes 55 seconds
> Server load: 23.09 22.46 21.88
> Total accesses: 68346 - Total Traffic: 10.0 GB
> CPU Usage: u97541.5 s126.35 cu787.35 cs139.55 - 14.2% CPU load
> .0986 requests/sec - 15.1 kB/second - 152.7 kB/request
> 6 requests currently being processed, 94 idle workers
>
> _____________WW_____W__W_____________W_____W______.............W
> ....................W..W.W.W...W...W..........W.......WW..W.....
> ..........W...W..W.W..__________________________________________
> ________
>
> Scoreboard Key:
> "_" Waiting for Connection, "S" Starting up, "R" Reading Request,
> "W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
> "C" Closing connection, "L" Logging, "G" Gracefully finishing,
> "I" Idle cleanup of worker, "." Open slot with no current process
>
>
> Net stat shows some hung connections in "CLOSE_WAIT" state for one of the hosts (but not all) that have hung thread/connections:
>
> newyahoo% netstat | grep clienthostname
> newyahoo.WWW         clienthostname.62580 65142      0 49896      0 CLOSE_WAIT
> newyahoo.WWW         clienthostname.62579 65142      0 49896      0 CLOSE_WAIT
> newyahoo.WWW         clienthostname.62582 65142      0 49896      0 CLOSE_WAIT
> newyahoo.WWW         clienthostname.62591 65142      0 49896      0 CLOSE_WAIT
>
>
> Can anyone assist in debugging this?
>
> I would love to have these threads exist without having to manually restart the server.
All threads with zombie status (all but 2) have already exited.  There is just #16 stuck in LDAP and the main thread waiting for it to exit.
I don't think that this process could result in more than one non-idle thread in the status display.
If the process is using CPU and this is really stuck here for a while, then I guess the thread in LDAP is looping, and it doesn't make things worse to kill the process, but perhaps there is corruption in shared memory already and threads in other processes will be affected if they aren't already.  Be ready to restart if threads keep getting stuck in the same place.


>
> Thanks
> MJ
>
>
>
>



[Index of Archives]     [Open SSH Users]     [Linux ACPI]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Squid]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux