Deadlock ? 30s clone() leads to very long response time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Some of my web frontends experience weird performance issues from time to times,
and after a lot of troubleshooting and researches, I have no clue on what's
going on. I come to you to get hints on how to dig deeper into the issue and
to get your thought on whether this may be an Apache or anything else bug.

The setup is quite common: a few web frontends behind a load balancer.
All of them run Debian Squeeze and its packaged version of Apache2: 2.2.16,
using the prefork MPM and mostly serving PHP CGIs w/ SuPHP. The Kernel is
vanilla 2.6.32.59 patched with grsecurity.

The problem: all apache frontends becomes reaaaaally slow at the same time,
giving the impression that the entire cluster is down, although they still
answer to the requests correctly, if the client keeps waiting (more than 1 
minute...).

The load is very high (>> 30), but CPU/MEM/IO usage is regular. Apache2ctl
status / fullstatus provided me with details that would fit my expectations for
a regular production day. MaxClients is not hit. No iowait.

Most Apache2 processes are in the 'D' state at first, and will eventually be in
the 'S' state after a while. The master process keeps being in the 'Ds' state.


The following tests have been made after redirecting the incomming requests to
other frontends. The load got back near to 0, the resource usage figures were
even lower than before, but still:

I ran 'strace' on the apache processes and noticed that calls to clone() took at
least 30s:
  
  # strace -T $(ps auxw | awk '/sbin\/apache/{ print "-p " $2 }' | tr '\n' ' ') 2>&1 | grep clone

  [pid 20445] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f812f53ba10) = 28685 <30.003128>
  [pid 26259] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f812f53ba10) = 28686 <30.005437>
  [pid 26236] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f812f53ba10) = 28687 <30.003419>
  [pid 26230] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f812f53ba10) = 28688 <30.005405>
  [pid 26221] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f812f53ba10) = 28689 <30.001986>
  [pid 26214] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f812f53ba10) = 28690 <30.001550>
  [pid 26208] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f812f53ba10) = 28691 <30.002959>
  [pid 25761] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f812f53ba10) = 28693 <30.001823>
  [pid 25824] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f812f53ba10) = 28692 <30.003443>
  ...

Simple HTTP requests (default page on unmanaged vhost for example) took 30.*s
while still getting the correct answer.


Some more details:

- This problem happened several times on different machines
               is not reproductible (well, dunno how)
               happen on *all* frontends of a cluster at the *same* moment
                 (confirmed by the monitoring system with +/- 5mn delay)
- No other applications/processes are in a bad shape (does not look like a
  system deadlock)
- Nothing relevant in the logs (neither kernel nor apache)


Also, I found a testimony of a similar issue:
  http://serverfault.com/questions/305544/60-second-php-mail-delay-through-browser-apache-but-no-delay-through-comman


Has anybody here encountered such an issue ? Should I open a bugreport ? If so, what
should I provide the devs with ?


I am considering collecting core dumps of the apache processes. Should I use gcore
or kill -6 ?
Would core dumps be useful at all, considering that I have no c debugging
skills, and that there was no segfault ?


Thank you,
Best regards,

-- 
Sébastien Bocahu

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx



[Index of Archives]     [Open SSH Users]     [Linux ACPI]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Squid]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux